October 16, 2018
I made a mix of all your answers so everybody can see the good and bad
Next-generation sequencing has a potential to dramatically accelerate biological research, by enabling the comprehensive analysis of genomes, transcriptomes and DNA-protein interactions to become inexpensive, routine and widespread
The next-generation sequencing technologies offer novel and rapid ways for genome-wide characterization and profiling of mRNAs, small RNAs, transcription factor binding regions, structure of chromatin and DNA methylation patterns, ancient DNA, microbiology and metagenomics
This review discusses next generation sequencing technologies, recent advances in next generation sequencing systems and applications for next generation technologies
Amplicon sequencing is a highly targeted approach that enables researchers to analyze genetic variation in specific genomic regions
The ultra-deep sequencing of PCR products (amplicons) allows efficient variant identification and characterization
This method uses oligonucleotide probes designed to target and capture regions of interest, followed by next-generation sequencing (NGS).
Amplicon sequencing is useful for the discovery of rare somatic mutations in complex samples (such as tumors mixed with germline DNA)
Another common application is sequencing the bacterial 16S rRNA gene across multiple species, a widely used method for phylogeny and taxonomy studies, particularly in diverse metagenomics samples
Enables researchers to efficiently discover, validate, and screen genetic variants using a highly targeted approach
Supports multiplexing of hundreds to thousands of amplicons per reaction to achieve high coverage
Delivers highly targeted re-sequencing even in difficult-to-sequence areas, such as GC-rich regions
Allows flexibility for a wide range of experimental designs
Reduces sequencing costs and turnaround time compared to broader approaches such as whole-genome sequencing
Gen Era biotechnology i. company make amplicon generation in Istanbul
The price the all of genome 3000 euro
Generic adaptors are added to the ends and these are annealed to beads, one DNA fragment per bead
The fragments are then amplified by PCR using adaptor-specifc primers
Each bead is then placed in a single well of a slide
So each well will contain a single bead, covered in many PCR copies of a single sequence
The slide is flooded with one of the four NTP species
Where this nucleotide is next in the sequence, it is added to the sequence read
If that single base repeats, then more will be added
So if we flood with Guanine bases, and the next in a sequence is G, one G will be added, however if the next part of the sequence is GGGG, then four Gs will be added.
The addition of each nucleotide releases a light signal
These locations of signals are detected and used to determine which beads the nucleotides are added to
This NTP mix is washed away
The next NTP mix is now added and the process repeated, cycling through the four NTPs.
This kind of sequencing generates graphs for each sequence read, showing the signal density for each nucleotide wash
The sequence can then be determined computationally from the signal density in each wash.
All of the sequence reads we get from 454 will be different lengths, because different numbers of bases will be added with each cycle.
sequencing by synthesis
can sequence 0.7 gigabase per run
run time 23 hours
read lenght 700-800 bases
Roche 454 pyrosequencing by synthesis (SBS) was the first commercially successful second generation sequencing system developed by 454 Life Sciences in 2005 and acquired by Roche in 2007 http://www.my454.com
This technology uses sequencing chemistry, whereby visible light is detected and measured after it is produced by an ATP sulfurylase, luciferase, DNA polymerase enzymatic system in proportion to the amount of pyrophosphate that is released during repeated nucleotide incorporation into the newly synthesized DNA chain
The system was miniaturized and massively parallelized using PicoTiterPlates to produce more than 200,000 reads at 100 to 150 bp per read with an output of 20 Mb per run in 2005
The upgraded 454 GS FLX Titanium system released by Roche in 2008 improved the average read length to 700 bp with an accuracy of 99.997% and an output of 0.7 Gb of data per run within 24 h.
The GS Junior bench-top sequencer produced a read length of 700 bp with 70 Mb throughput and runtime of 10 to 18 h.
The major drawbacks of this technology are the high cost of reagents and high error rates in homopolymer repeats.
The estimated cost per million bases is $10 by Roche 454 compared to $0.07 by Illumina HiSeq 2000.
A more serious challenge for those using this technology is the announcement by Roche that they will no longer supply or service the 454 sequencing machines or the pyrosequencing reagents and chemicals after 2016
also sequencing by synthesis
and IMPROVING…
Illumina http://www.illumina.com purchased the Solexa Genome Analyzer in 2006 and commercialized it in 2007
Today, it is the most successful sequencing system with a claimed over 70% dominance of the market, particularly with the HiSeq and MiSeq platforms
The Illumina sequencer is different from the Roche 454 sequencer in that it adopted the technology of sequencing by synthesis using removable fluorescently labeled chain-terminating nucleotides that are able to produce a larger output at lower reagent cost
The clonally enriched template DNA for sequencing is generated by PCR bridge amplification (also known as cluster generation) into miniaturized colonies called polonies
The output of sequencing data per run is higher (600 Gb), the read lengths are shorter (approximately 100 bp), the cost is cheaper, and the run times are much longer (3-10 days) than most other systems [54]
Illumina provides six industrial-level sequencing machines (NextSeq 500, HiSeq series 2500, 3000, and 4000, and HiSeq X series five and ten) with mid to high output (120–1500 Gb) as well as a compact laboratory sequencer called the MiSeq, which, although small in size, has an output of 0.3 to 15 Gb and fast turnover rates suitable for targeted sequencing for clinical and small laboratory applications
The MiSeq uses the same sequencing and polony technology such as the high-end machines, but it can provide sequencing results in 1 to 2 days at much reduced cost
Illumina’s new method of synthetic long reads using TruSeq technology apparently improves de novo assembly and resolves complex, highly repetitive transposable elements
sequencing by ligation
Supported Oligonucleotide Ligation and Detection (SOLiD) is a next-generation sequencer instrument marketed by Life Technologies http://www.lifetechnologies.com and first released in 2008 by Applied Biosystems Instruments (ABI)
It is based on 2-nucleotide sequencing by ligation (SBL)
This procedure involves sequential annealing of probes to the template and their subsequent ligation
Sequencers on the market today, such as the 5500 W series, are suitable for small- and large-scale projects involving whole genomes, exomes, and transcriptomes
Previously, sample preparation and amplification was similar to that of Roche 454 sequencing
However, the upgrades to Wildfire chemistry have enabled greater throughput and simpler workflows by replacing beads with direct in situ amplification on FlowChips and paired-end sequencing
The SOLiD 5500 W series sequencing reactions still use fluorescently labeled octamer probes in repeated cycles of annealing and ligation that are interrogated and eventually deciphered in a complex subtractive process using Exact CallChemistry that has been well described by others
The advantage of this method is accuracy with each base interrogated twice
The major disadvantages are the short read lengths (50–75 bp), the very long run times of 7 to 14 days, and the need for state-of-the-art computational infrastructure and expert computing personnel for analysis of the raw data
Complete Genomics http://www.completegenomics.com developed DNA nanoball sequencing (DNBS) as a hybrid of sequencing by hybridization and ligation
Small fragments (440–500 bp) of genomic DNA or cDNA are amplified into DNA nanoballs by rolling-circle replication that requires the construction of complete circular templates before the generation of nanoballs
The DNA nanoballs are deposited onto an arrayed flow cell, with one nanoball per well sequenced at high density
Up to 10 bases of the template are read in 5′ and 3′ direction from each adapter
Since only short sequences, adjacent to adapters, are read, this sequencing format resembles a multiplexed form of mate-pair sequencing similar to using Exact Call Chemistry in SOLiD sequencing
Ligated sequencing probes are removed, and a new pool of probes is added, specific for different interrogated positions
The cycle of annealing, ligation, washing, and image recording is repeated for all 10 positions adjacent to one terminus of one adapter
This process is repeated for all seven remaining adapter termini
Although the developers have sequenced the whole human genome, the major disadvantage of DNBS is the short length of reads and the length of time for the sequencing projects
Claimed cost of the reagents for sequencing of the whole human genome is under $5000
The major advantage of this approach is the high density of arrays and therefore the high number of DNBs (~350 million) that can be sequenced
In 2015, the Chinese genomics service company BGI-Shenzhen acquired Complete Genomics and introduced the Retrovolocity system for large-scale, high quality whole-genome and whole-exome sequencing with 50x coverage per genome and with the sample to assembled genome produced in less than 8 days
Complete Genomics claims to have sequenced more than 20,000 whole human genomes over 5 years and published widely on the use of their NGS platform
They provide public access to a human repository of 69 genomes data and a cancer data set of two matched tumor and normal sample pairs at http://www.completegenomics.com/public-data/
Ion Torrent technology http://www.iontorrent.com was developed by the inventors of 454 sequencing, introducing two major changes.
Firstly, the nucleotide sequences are detected electronically by changes in the pH of the surrounding solution proportional to the number of incorporated nucleotides rather than by the generation of light and detection using optical components.
Secondly, the sequencing reaction is performed within a microchip that is amalgamated with flow cells and electronic sensors at the bottom of each cell
The incorporated nucleotide is converted to an electronic signal detected by the electronic sensors
The two sequencers in the market that use Ion Torrent technology are the high-throughput Proton sequencer with more than 165 million sensors and the Ion Personal Genome Machine (PGM), a bench-top sequencer with 11.1 million sensors
There are four sequencing chips to choose from
The Ion PI Chip is used with the Proton sequencer, and the Ion 314, 316, or 318 Chips are used with the Ion PGM.
The Ion 314 Chip provides the lowest reads at 0.5 million reads per chip, whereas the Ion 318 Chip provides the highest reads of up to 5.5 million reads per chip
The Proton sequencer provides a higher throughput (10–100 Gb vs. 20 Mb–1 Gb) and more reads per run (660 Mb vs. 11 Mb) than the PGM chips, but the read lengths (200–500 bp), run time (4–5 h), and accuracy (99%) are similar
Sample preparation for the generation of DNA libraries is similar to the one used for Roche 454 sequencing but can be simplified with the use of the Ion Chef system for automated template preparation and chip loading
The Ion Torrent chip is used with an ion-sensitive field-effect transistor sensor that has been engineered to detect individual protons produced during the sequencing reaction
The chip is placed within the flow cell and is sequentially flushed with individual unlabeled dNTPs in the presence of the DNA polymerase
Incorporation of nucleotide into the DNA chain releases H protons and changes the pH of the surrounding solution that is proportional to the number of incorporated nucleotides
The major disadvantages of the system are problems in reading homopolymer stretches and repeats
The major advantages seem to be the relatively longer read lengths, flexible workflow, reduced turnaround time, and a cheaper price than those provided by the other platforms.
Method | Generation | Read length (bp) Single pass | Error rate (%) | No. of reads per run | Time per run | Cost per million bases (USD) |
---|---|---|---|---|---|---|
Sanger ABI 3730x1 | 1st | 600-1000 | 0.001 | 96 | 0.5 - 3 h | 500 |
Illumina HiSeq 2500 (Rapid Run) | 2nd | 2 x 250 | 0.1 | 1.2 x 109 (paired) | 1 - 6 days | 0.04 |
PacBio RS II: P6-C4 | 3rd | 1.0–1.5 x 104 | 13 | 3.5–7.5 x 104 | 0.5 - 4 h | 0.4 - 0.8 |
Oxford Nanopore MinION | 3rd | 2–5 x 103 | 38 | 1.1–4.7 x 104 | 50 h | 6.44 - 17.9 |
Instrument | Primary Errors | Single-pass Error Rate (%) | Final Error Rate (%) |
---|---|---|---|
3730xl (capillary) | substitution | 0.1-1 | 0.1-1 |
454 All models | Indel | 1 | 1 |
Illumina All Models | substitution | ~0.1 | ~0.1 |
Ion Torrent – all chips | Indel | ~1 | ~1 |
SOLiD – 5500xl | A-T bias | ~5 | ≤0.1 |
Oxford Nanopore | deletions | ≥4 | 4 |
PacBio RS | Indel | ~13 | ≤1 |
As we saw in the last class, mapping reads to a reference genome is a useful technique
Mapping is finding where each read maps on the genome
Here “maps” means “the genome sequence is very similar to the read”
The idea is that
then we want to know where the fragment came from
For now, let’s assume that the read is 100% identical to a part of the genome
The problem is identical to “finding a word in a text”
How does CTRL-F works?
for(i in 1:length(genome)) { n <- 0 for(j in 1:length(read)) { if(genome[i]==read[j]){ n <- n+1 } } if(n==length(read)){ print(i) } }
How long does it take? (time)
The time of a program is usually called its computational cost
It depends on the genome length \(G\) and the read length \(L\)
There are two loops. The inner part of the loop is done \(G\cdot L\) times
We say that the cost of this program is \(O(GL)\)
Cost is “on the order of” \(G\cdot L\)
For example, for a genome of 3 million bp and a read of 500 bp, the cost is on the order of 1.5E9 steps
That is 1.500.000.000 steps for one read
We have to look for several reads. Millions. Let’s say \(N\)
Then we have to repeat everything \(N\) times
If we have 2 million reads, we multiply 1.5E9 by 2E6
That is 3E15 = 3.000.000.000.000.000 steps
Assuming that the computer can do 1000 million steps per second, this is \[3\cdot 10^{15}\text{ steps}\cdot \frac{1\text{ sec}}{10^{9}\text{ steps}}\cdot\frac{1\text{ day}}{86400\text{ sec}}\approx 34.7\text{ days}\]
Clearly, a direct search is a bad idea
All modern systems use a speed-up technique based in some kind of index
Indices are an auxiliary table that show where we can find each “word” in the “text”
For example as we see at the end of a book
“Think Complexity” by Allen B. Downey CC-BY-NC-SA http://thinkcomplex.com/thinkcomplexity.pdf
Most of the algorithms used in bioinformatics were initially created for other purposes
Here we are using ideas that were initially developed for searching patterns (such as words) inside long texts, like in a book
In our context, the text will be the full genome, and the words are subsequences of DNA of a fixed length.
Usually, the word length is called k, and the words are called k-mers
To look for a read in the genome, we see which k-mers are in the read, and we look for them first
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTAT
key | position | key | position |
---|---|---|---|
AA | 24 | GA | 1, 36 |
AC | 5, 16, 25, 28, 32 | GC | 38, 47 |
AG | 7, 37 | GG | 8, 34, 35, 53 |
AT | 2, 13, 21, 45, 49, 56 | GT | 9, 54 |
CA | 4, 6, 15, 27, 31, 44, 48 | TA | 12, 20, 23, 55 |
CC | 17, 18, 26, 43 | TC | 3, 10, 14, 30, 40, 42 |
CG | 33 | TG | 46, 52 |
CT | 11, 19, 29, 39, 41 | TT | 22, 50, 51 |
Searching in the index is easy, because it is sorted
When data is sorted, we can search quickly by dividing the list in half every step
This is called Binary Search. Google it
Using this strategy of dividing the index in half on every step, we need \[\log_2(\text{Number of words in the index})\cdot (\text{word length})\] steps to find any word in the index
It is easy to see that there are \(4^k\) different k-mers
Some may appear several times on the genome, some may not appear
On average, each k-mer appears \(G/4^k\) times on the genome
In any case, the number of words in the index is \(4^k\)
Using the strategy of binary search, we can find each word in the index
\[\log_2(4^k)\cdot k=\log_2(2^{2k})\cdot k=2k^2\]
After we find the k-mer in the index, we have to see it the rest of the read is on the genome
This time we do not need to see all the genome
we have to look only \(G/4^k\) positions on the genome
As before, testing to see if the read is on any position takes \(L\) steps
The total cost of a single read is then \[2k^2\cdot\frac{G}{4^k}\cdot L\]
For the same genome of 3 million bp, using k=4, the index will have 256 entries, each of length \[\frac{3\times 10^{6}}{4^4}\approx 1.17\times 10^{4}\] searching one read of 500bp will take \[2\cdot 4^2\cdot 1.17\times 10^{4} \cdot 500=1.88\times 10^{8}\] steps, that is, 0.19 seconds per read. For 2 million reads, this takes \[2\times 10^{6}\text{ reads}\cdot 0.19\frac{\text{sec}}{\text{read}}\cdot\frac{1\text{ day}}{86400\text{ sec}}\approx 4.34\text{ days}\]
For the same genome of 3 million bp, using k=6, the index will have 4096 entries, each of length \[\frac{3\times 10^{6}}{4^6}\approx 732.422\] searching one read of 500bp will take \[2\cdot 6^2\cdot 732.422 \cdot 500=2.637\times 10^{7}\] steps, that is, 0.026 seconds per read. For 2 million reads, this takes \[2\times 10^{6}\text{ reads}\cdot 0.026\frac{\text{sec}}{\text{read}}\cdot\frac{1\text{ hour}}{3600\text{ sec}}\approx 14.648\text{ hours}\]
For the same genome of 3 million bp, using k=8, the index will have \(6.554\times 10^{4}\) entries, each of length \[\frac{3\times 10^{6}}{4^8}\approx 45.776\] searching one read of 500bp will take \[2\cdot 8^2\cdot 45.776 \cdot 500=2.93\times 10^{6}\] steps, that is, 0.003 seconds per read. For 2 million reads, this takes \[2\times 10^{6}\text{ reads}\cdot 0.003\frac{\text{sec}}{\text{read}}\cdot\frac{1\text{ hour}}{3600\text{ sec}}\approx 1.628\text{ hours}\]
As we can see, using an index is much faster. But we need to make and store the index
Can you estimate how many steps do we need to make the index?
Can you estimate how much memory is needed?
Obviously, the answer depends on the value of \(k\)