We have reads that are close related to a reference genome
For example, RNA messenger from the same individual
Or DNA from a new individual from the same species
Or DNA from a species on the same genus
October 19, 2018
We have reads that are close related to a reference genome
For example, RNA messenger from the same individual
Or DNA from a new individual from the same species
Or DNA from a species on the same genus
The answer depends on the case
There are several tool for the same goal
Two tools that are popular today are:
They have a similar philosophy
can you find others?
We will use bwa today. You can try with bowtie and the same data
Let’s connect to the server and change to our working directory using cd bioinfo
We start with the genome of S. cerevisiae, which is on the server at /home/anaraven/2018/bioinfo/Saccharomyces_cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz
This is a compressed file, as we can realize by the .gz
extension
Since NGS produce huge amounts of data, the files are usually compressed
Many programs, such as bwa can read files either compressed or not
But other tools cannot, so we will decompress the file
That is bad because we use more time and memory
Check every case if you can avoid decompressing
The original file is in read-only mode, so we cannot replace it
Instead each one will have an uncompressed copy
Type this in one line (using [TAB] to save time)
zcat /home/anaraven/2018/bioinfo/Saccharomyces_ cerevisiae.R64-1-1.dna_sm.toplevel.fa.gz > yeast.fasta
The first way to verify is to take a look using
less yeast.fasta
Since the file is big, we cannot read it all. Instead we use the computer
grep '>' yeast.fasta
and
wc yeast.fasta
For today’s exercise we will use two files
/home/anaraven/2018/bioinfo/y1.fastq /home/anaraven/2018/bioinfo/y2.fastq
You can copy them to your folder
cp /home/anaraven/2018/bioinfo/y?.fastq .
Better: link them to your folder, so we use less memory
ln -s /home/anaraven/2018/bioinfo/y?.fastq .
Take a look at the first lines of each fastq file
You can see that the sequences are different, but they have the same ID. Why?
These are paired ends. There are two reads for each fragment.
As we discussed previously, these programs use an index to speedup the search
So the first step is
bwa index yeast.fasta
List the files in your folder and verify that there are new files. You can take a look at them
$ bwa index yeast.fasta [bwa_index] Pack FASTA... 0.18 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 7.65 seconds elapse. [bwa_index] Update BWT... 0.08 sec [bwa_index] Pack forward-only FASTA... 0.06 sec [bwa_index] Construct SA from BWT and Occ... 2.26 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index yeast.fasta [main] Real time: 10.400 sec; CPU: 10.244 sec