In broad terms there are three alternatives
In all these cases we get a DNA fragment, about a few thousands bp long (let’s say 500–5000bp)
Each fragment is a physical object
Each strand can be sequenced and produce a read
Usually we have two reads for each fragment
They are sometimes called “forward” and “reverse” reads
There are many sequencing methods, based on different physicochemical properties
They are in constant development. We will focus on their key properties
See DNA sequencing on the WikiPedia for more details
DNA is separated in 4 parts, each part is mixed with a different restriction enzyme. Then the fragments are separated by electrophoresis.
In a second version, each part is marked with a different fluorophores and then placed in a single capilar electrophoresis, with a light detector
Part of the vector will be sequenced in every read
The result is a chromatogram. Filenames ending in .AB1
or .SCF
Third generation
Generic adaptors are added to the ends and annealed to beads, one DNA fragment per bead
The fragments are then amplified by PCR
Each bead is placed in a single well of a slide
Each well will contain a single bead, covered in many PCR copies of a single sequence
The slide is flooded with one of the four NTP species
The nucleotide is incorporated when it matches the template
If that single base repeats, then more will be added
The addition of each nucleotide releases a light signal, and is registered with a hight resolution video
This NTP mix is washed away
The next NTP mix is now added and the process repeated, cycling through the four NTPs.
This kind of sequencing generates graphs for each sequence read, showing the signal density for each nucleotide wash
The sequence can then be determined computationally from the signal density in each wash.
Sequencing by synthesis
can sequence 0.7 gigabase per run
run time 23 hours
read length 700-800 bases
also sequencing by synthesis
over 70% of the market
Sequencing by ligation
Supported Oligonucleotide Ligation and Detection (SOLiD)
The advantage of this method is accuracy with each base interrogated twice
Major disadvantages
Method | Read length (bp) | Error rate (%) | No. of reads per run | Time per run | Cost per million bases (USD) |
---|---|---|---|---|---|
Sanger ABI 3730x1 | 600-1000 | 0.001 | 96 | 0.5–3 h | 500 |
Illumina HiSeq 2500 (Rapid Run) | 2 × 250 | 0.1 | 1.2 × 109 (paired) | 1–6 days | 0.04 |
PacBio RS II: P6-C4 | 1.0–1.5 × 104 | 13 | 3.5–7.5 × 104 | 0.5–4 h | 0.4–0.8 |
Oxford Nanopore MinION | 2–5 × 103 | 38 | 1.1–4.7 × 104 | 50 h | 6.44–17.9 |
Sequencers produce signals
each method has their own tool to call the bases
For Sanger, that tool is called Phred
Signal–Ratio is different for each base
As usual we use logarithms, multiply by a constant, and take the integer part \[\lfloor10\cdot\log_{10}(\text{Signal}/\text{Noise})\rfloor\]
Q | Error.probability | Probability.of.correct |
---|---|---|
10 | 0.1000000 | 0.9000000 |
15 | 0.0316228 | 0.9683772 |
20 | 0.0100000 | 0.9900000 |
30 | 0.0010000 | 0.9990000 |
40 | 0.0001000 | 0.9999000 |
50 | 0.0000100 | 0.9999900 |
60 | 0.0000010 | 0.9999990 |
70 | 0.0000001 | 0.9999999 |
80 | 0.0000000 | 1.0000000 |
.phd
)BEGIN_SEQUENCE GTFAG75TF.scf
BEGIN_COMMENT
CHROMAT_FILE: GTFAG75TF.scf
ABI_THUMBPRINT: 0
TRACE_PROCESSOR_VERSION:
BASECALLER_VERSION: phred 0.071220.c
PHRED_VERSION: 0.071220.c
CALL_METHOD: phred
QUALITY_LEVELS: 99
TIME: Fri Nov 22 15:00:37 2019
TRACE_ARRAY_MIN_INDEX: 0
TRACE_ARRAY_MAX_INDEX: 9116
TRIM: 34 555 0.0500
TRACE_PEAK_AREA_RATIO: 0.0255
CHEM: prim
DYE: rhod
END_COMMENT
BEGIN_DNA
g 6 8
a 6 24
g 6 28
a 8 45
g 8 50
a 8 59
t 8 77
c 8 87
t 9 100
a 9 115
g 11 124
t 11 140
g 15 147
t 8 162
g 9 170
c 12 182
t 9 197
g 17 207
g 20 221
a 21 234
.qual
This is FASTA format, with numbers instead of letters
Each number is the quality of the corresponding letter
>GTFAG75TF.scf 702 0 702 SCF
6 6 6 8 8 8 8 8 9 9 11 11 15 8 9 12 9 17 20 21 20
18 9 10 6 6 6 6 8 16 11 6 6 8 20 17 30 30 44 44 29
16 9 9 9 22 23 44 44 32 32 32 34 34 34 38 38 38 38
34 38 38 38 38 44 34 34 34 39 44 47 47 53 53 53 53
53 47 47 47 47 47 47 53 53 47 53 53 53 47 59 59 59
59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
59 59 59 59 59 59 59 59 59 59 59 24 24 59 59 59 59
62 62 62 62 62 62 62 59 59 59 59 59 59 51 59 59 62
59 59 59 59 59 59 59 59 56 59 59 59 59 59 56 56 56
39 38 38 38 56 56 56 56 59 62 62 62 62 59 59 59 59
59 59 59 62 62 62 62 62 62 44 40 40 28 28 28 59 59
56 59 59 56 56 56 56 59 56 56 59 59 59 59 59 59 59
59 56 56 56 56 56 56 59 59 59 59 62 59 59 56 56 56
59 56 59 59 59 59 59 59 47 59 59 59 59 59 59 59 59
59 59 59 59 59 56 56 56 56 56 59 62 59 59 59 59 59
59 59 59 59 59 59 59 62 62 62 62 59 59 59 59 59 59
59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
25 25 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
59 53 53 43 43 43 53 53 53 53 53 53 53 53 53 53 47
47 47 47 47 44 39 53 39 39 39 39 39 53 53 53 53 53
47 47 43 44 38 38 32 38 27 27 32 31 29 29 35 35 36
35 35 38 38 43 43 43 38 53 39 34 34 53 53 47 43 47
53 53 53 53 53 53 47 47 43 39 32 32 29 30 30 38 34
35 26 44 44 44 38 33 26 24 27 29 33 33 35 38 38 38
34 34 34 38 32 38 34 34 34 39 32 39 34 30 32 32 34
34 38 38 38 38 38 38 38 38 29 29 27 18 17 27 27 27
20 20 33 29 33 32 32 32 35 38 38 34 34 34 35 47 34
38 32 32 25 27 20 20 15 24 30 32 32 32 31 29 29 27
29 32 31 24 29 33 32 33 31 28 29 29 24 24 29 29 29
32 32 31 29 24 29 24 24 23 19 20 23 21 21 17 18 10
10 10 9 7 9 9 9 9 11 19 25 21 21 20 23 24 24 24 18
18 10 10 10 18 20 20 14 24 15 20 20 29 24 17 20 15
14 14 12 11 9 8 6 6 6 8 8 8 10 10 10 18 18 24 24 16
18 13 11 11 14 10 9 16 11 10 10 11 10 13 10 19 21
17 8 6 6 6 11 10 11 11 10 10 12 10 19 18 9 7 10 6
6 6 9 11 9 10 10 12 8 6 6 6 6 8 8 9 9 9 9 8 11 11
11 7 7 11 6 6 6 6 6 6 9 7 6 6 8 8 11 6 6 6 6 6 6 6
9 6 6 6 8 6 8 9 8 6 6 6 6 6 9 6 6 8 6 6 6 6 6 8 11
12 11 9 9 9 7 12 8 9 7 9 9 9 6 6 9 9 7 7 7 7
Inspired by FASTA, combining sequences and qualities
@
instead of >
, and the read ID+
, can be followed by the ID again@001043_1783_0863
CTGATCATTGGGCGTAAAGAGTGCGCAGGCGGTTTGTTAAGCGAGATGTGAAAGCCCCGGGCTCAACCTGGGAATTGCATTTCGAACTGGCGAACTAGAGTCTTGTAGAGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGCACGAAAGCGTGGGGAGCAAACAGGATTAAATACCCTCGTA
+001043_1783_0863
<<<<<<<>;>=0<<<=<-<<<<<<<<<>:<>:>=/<?;?;<<<<<<<<<<>=0<<<5(=<-<<<>;?;<>=0>;?;<<<>=/<<>:;<<8<<?;<<<:<<<<?;<<<<<;:71(;;;?;>9>:<>:;<<<;<>9<<>=1<<<<<<<<<<<<<>;;?;>:<:?;?;<>:<<?;?;<?;;;70';>;9<>=1;:<;;;:<<<<<?;;<<<>=2<<;<<<5(<;9>=0;<?;<>;=<2:8>=1:9<<9
Each quality value between 0 and 90 is represented by the corresponding symbol in the following text
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Initially FASTQ were supposed to have sequences and qualities in several lines, but the symbols +
and @
can be quality values
To avoid ambiguity, each entry has exactly 4 lines.
Once we have all the reads, how do we get the genome?
Imagine a book of 1000 pages:
The problem is to reconstruct the original book
If we already know the genome, we do not need to assemble it
This is obvious. Still, it is an important case
Also called ab initio, these are Latin words meaning “from new” or “from the start”
There are basically three strategies used to assemble genomes
Today we will speak about the first two strategies
Given a set of sequence fragments, the object is to find a longer sequence that contains all the fragments.
The final sequence may be wrong
Two different reads A, B may match the same read X
To decide which one is correct, a Layout stage is used
In this approach, the overlap between reads is used to build a graph of reads relationships
This graph determines the layout of all reads,
that is, their relative positions
Usually there are several independent groups of reads that are not connected in the layout graph
(we say that the graph has several connected components)
All the reads that are connected together form a contig
Once the layout is clear, the last step is to retrieve the consensus sequence of the contig
In fact, it is not a consensus. It is a vote. The majority wins
High quality votes are more important than low quality ones
The human genome project used Phred for base-calling and Phrap for assembly
Phrap produces several files. The most important has extension .ace
These programs are free for academic usage (non-commercial)
AS 36 39
CO Contig1| 131 1 1 U
gctagaaaaaaaaggactcccagtagaaatacgtacaataaagtaggttc
ctctagttaactgttacaaaataagtttcccattggtaatataatagatt
tataactgttatatccagagcaacctagggg
BQ
15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15
AF 5765593 U 1
BS 1 131 5765593
RD 5765593 131 0 6
gctagaaaaaaaaggactcccagtagaaatacgtacaataaagtaggttc
ctctagttaactgttacaaaataagtttcccattggtaatataatagatt
tataactgttatatccagagcaacctagggg
CO Contig36 111 2 63 U
taTAAAGTCGATGGGGAGGAAGATAGGGGAGCTAAAGCCATAGGGAAACC
ACGTAGTTCTGCGTCAAGCGTTgccttcCGAGGTGCTCTCCGCTTTTCCA
TGCtccaatcg
BQ
15 15 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25
25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 15 15 15 15 15 15 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25 25
25 25 25 15 15 15 15 15 15 15 15
AF 5648323 U 1
AF 5703145 U 1
Indiana University Bloomington, “Introduction to Bioinformatics”, Lecture by Yuzhen Ye
Université Lyon I, Network Algorithms for Molecular Biology lesson on “Introduction to (de novo) assembly”, by Blerina Sinaimeri
MIT, “Foundations of Computational Systems Biology”, Lecture by David K. Gifford