We have reads that are close related to a reference genome
For example, RNA messenger from the same individual
Or DNA from a new individual from the same species
Or DNA from a species on the same genus
October 23, 2018
We have reads that are close related to a reference genome
For example, RNA messenger from the same individual
Or DNA from a new individual from the same species
Or DNA from a species on the same genus
bwa
As we discussed previously, bwa
uses an index to speedup the search
So the first step is
bwa index yeast.fasta
List the files in your folder and verify that there are index files
Let’s first align only one set of reads
$ bwa mem yeast.fasta y1.fastq -o out1.sam
The output is in Sequence Alignment/Map file format
We use the extension .sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (900000 bp)... [M::mem_process_seqs] Processed 25000 reads in 0.867 CPU sec, 0.867 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -o out1.sam yeast.fasta y1.fastq [main] Real time: 0.961 sec; CPU: 0.939 sec
.sam
file?@SQ SN:XI LN:666816 @SQ SN:IX LN:439888 @SQ SN:IV LN:1531933 @SQ SN:III LN:316620 @SQ SN:X LN:745751 @SQ SN:XII LN:1078177 @SQ SN:VIII LN:562643 @SQ SN:VI LN:270161 @SQ SN:V LN:576874 @SQ SN:II LN:813184 @SQ SN:XIV LN:784333 @SQ SN:XIII LN:924431 @SQ SN:XVI LN:948066 @SQ SN:XV LN:1091291 @SQ SN:I LN:230218 @SQ SN:Mito LN:85779 @SQ SN:VII LN:1090940 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta y1.fastq -o out1.sam SRR507778.1 0 Mito 20158 60 36M * 0 0 ANTATAATATTATCCCCACGAGGGCCACACATGTGT ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7 NM:i:1 MD:Z:1A34 AS:i:34 XS:i:0 SRR507778.2 16 XV 767197 60 36M * 0 0 ACCCATTTCATAATTAATAAGTGGTATATCAGGANG <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#: NM:i:1 MD:Z:34C1 AS:i:34 XS:i:0 SRR507778.3 0 II 569438 60 36M * 0 0 GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8 NM:i:1 MD:Z:1C34 AS:i:34 XS:i:0
Lines starting with @
are metadata
They describe the conditions of the alignment
Line like this
@SQ SN:XI LN:666816 @SQ SN:IX LN:439888 @SQ SN:VII LN:1090940
describe the sequence
Line like this
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem yeast.fasta y1.fastq -o out1.sam
describe how the program was executed. It includes
Many lines like this:
SRR507778.1 0 Mito 20158 60 36M * 0 0 ANTATAATATTATCCCCACGAGGGCCACACATGTGT ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7 NM:i:1 MD:Z:1A34 AS:i:34 XS:i:0 SRR507778.2 16 XV 767197 60 36M * 0 0 ACCCATTTCATAATTAATAAGTGGTATATCAGGANG <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#: NM:i:1 MD:Z:34C1 AS:i:34 XS:i:0 SRR507778.3 0 II 569438 60 36M * 0 0 GNAAGAACCCTTATACTAACAGCAGCATAGGAAAGT ;#>45<4454GEDGGGGGGECEACBBBB=FGADAG8 NM:i:1 MD:Z:1C34 AS:i:34 XS:i:0
Sequence Alignment/Map (SAM) format is TAB-delimited.
1 QNAME Query template/pair NAME 2 FLAG bitwise FLAG 3 RNAME Reference sequence NAME 4 POS 1-based leftmost POSition/coordinate of clipped sequence 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR extended CIGAR string 7 MRNM Mate Reference sequence NaMe (`=' if same as RNAME) 8 MPOS 1-based Mate POSistion 9 TLEN inferred Template LENgth (insert size) 10 SEQ query SEQuence on the same strand as the reference 11 QUAL query QUALity (ASCII-33 gives the Phred base quality) 12+ OPT variable OPTional fields in the format TAG:VTYPE:VALUE
Each bit in the FLAG field is defined as:
1 p the read is paired in sequencing 2 P the read is mapped in a proper pair 4 u the query sequence itself is unmapped 8 U the mate is unmapped 16 r strand of the query (1 for reverse) 32 R strand of the mate 64 1 the read is the first read in a pair 128 2 the read is the second read in a pair 256 s the alignment is not primary 512 f the read fails platform/vendor quality checks 1024 d the read is either a PCR or an optical duplicate 2048 S the alignment is supplementary
The FLAG column is a number
It is the sum of numbers on the previous table
For example
SRR507778.1 0 ... SRR507778.2 16 ....
The first read is not paired, and aligned in the Forward strand
The second read is not paired, and aligned in the Reverse strand
Most of the flags only make sense when we combine both read files
Let’s do that
bwa mem yeast.fasta y1.fastq y2.fastq > out2.sam
[M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 50000 sequences (1800000 bp)... [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (103, 4007, 13219, 91) [M::mem_pestat] analyzing insert size distribution for orientation FF... [M::mem_pestat] (25, 50, 75) percentile: (240, 2467, 3114) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8862) [M::mem_pestat] mean and std.dev: (1893.02, 1414.42) [M::mem_pestat] low and high boundaries for proper pairs: (1, 11736) [M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (243, 275, 314) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (101, 456) [M::mem_pestat] mean and std.dev: (278.34, 50.30) [M::mem_pestat] low and high boundaries for proper pairs: (30, 527) [M::mem_pestat] analyzing insert size distribution for orientation RF... [M::mem_pestat] (25, 50, 75) percentile: (3111, 3320, 3565) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (2203, 4473) [M::mem_pestat] mean and std.dev: (3352.46, 321.09) [M::mem_pestat] low and high boundaries for proper pairs: (1749, 4927) [M::mem_pestat] analyzing insert size distribution for orientation RR... [M::mem_pestat] (25, 50, 75) percentile: (291, 2611, 3188) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 8982) [M::mem_pestat] mean and std.dev: (2036.04, 1501.65) [M::mem_pestat] low and high boundaries for proper pairs: (1, 11879) [M::mem_pestat] skip orientation FF [M::mem_pestat] skip orientation RR [M::mem_process_seqs] Processed 50000 reads in 6.397 CPU sec, 6.396 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem yeast.fasta y1.fastq y2.fastq [main] Real time: 6.557 sec; CPU: 6.503 sec
SRR507778.1 99 Mito 20158 60 36M = 16941 -3183 ANTATAATATTATCCCCACGAGGGCCACACATGTGT ?#?;<BBB?BBGGEGGGGDDE8EEDB:?<=?BB?B7 NM:i:1 MD:Z:1A34 MC:Z:36M AS:i:34 XS:i:0 SRR507778.1 147 Mito 16941 60 36M = 20158 3183 TTCATAGTACCCAAATTTAATTTAAATAAAGTGAGA GFFCFDDG<DIIGHIIEEE>EBAGG@GBDDDBG??B NM:i:0 MD:Z:36 MC:Z:36M AS:i:36 XS:i:0 SRR507778.2 83 XV 767197 60 36M = 767020 -213 ACCCATTTCATAATTAATAAGTGGTATATCAGGANG <?DGEDBHHHHHHHDHFGFHFGFGHG@@A=@;;7#: NM:i:1 MD:Z:34C1 MC:Z:36M AS:i:34 XS:i:0 SRR507778.2 163 XV 767020 60 36M = 767197 213 CTAGTCATTTTTCTAACACCGAAATAACGGCAGAGC GIIIIG:IIIIIIIIIHIIIGG?GDEGDGGHIHBII NM:i:0 MD:Z:36 MC:Z:36M AS:i:36 XS:i:0
Each read has its “pair”
fragment ======================================== SE read ---------> PE reads R1---------> <---------R2 unknown gap ....................
The two FASTQ files need to have the same number of reads
The first read of one file matches the first read of the other file
The n-th read of one file matches the n-th read of the other file
Basically
Good ----> <---- bad <---- ----> bad ----> ---->