- Some organisms have been sequenced
- We can read the sequences from FASTA files
read.fasta
returns a list of vectors of characters- We can create functions to do things with the sequence
- For example we do statistics of the sequence
March 11, 2020
read.fasta
returns a list of vectors of charactersStatistics is a way to tell a story that makes sense of the data
In genomics, we look for biological sense
That story can be global: about the complete genome
Or can be local: about some region of the genome
We will start with global properties
The percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine.
For example, measuring the melting temperature of the DNA double helix using spectrophotometry
If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.
GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]
Write the code for this function:
gc_content <- function(genome) { # count letters return(answer) }
Then use it to find the GC content of E.coli
Is the GC content uniform through all genome?
Do all genes have the same GC content?
What function do we need to answer this question?
What is the name, the inputs, and the output?
Write the code for this function:
genes_gc_content <- function(genes) { # calculate GC content of each gene return(answer) }
What is the data structure of answer
?
.fna
: FASTA nucleotide. Full genome.ffn
: FASTA nucleotide with features, such as genes.faa
: FASTA amino-acids. Proteins or peptidesCalculate the GC content for only part of the genome
Instead of all the genome, we only look through a window
The result should depend on:
The new function looks like this
window_gc_content <- function(sequence, position, size) { # code here return(answer) }
What should be the code?
window_gc_content()
in many placesWe want to evaluate window_gc_content
on different positions of the genome
positions <- seq(from=1, to=length(genome), by= window_size)
Now we apply window_gc_content
to each element of positions
?
for(i in 1:length(positions)) { gc[i] <- window_gc_content(genome, positions[i], window_size) }
What is the size of \(G-C\) respect to the total \(G+C\)?
Calculate the ratio \[\frac{G-C}{G+C}\] for each gene and plot it
What do you see?
There is a richness of G over C and T over A in the leading strand
There is a richness of C over G and A over T in the lagging strand
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
The replication origin is usually called ori.
A
, turtle moves leftT
, turtle moves rightG
, turtle moves upC
, turtle moves downlibrary(TurtleGraphics) turtle_init(mode = "clip") turtle_hide() for(i in 1:1000) { if(dna[i]=="A") { turtle_setangle(90) } else if(dna[i]=="T") { turtle_setangle(270) } else if(dna[i]=="G") { turtle_setangle(0) } else if(dna[i]=="C") { turtle_setangle(180) } turtle_forward(1) }
This program takes a lot of time (it is a turtle, after all)
This is the drawing of Candidatus Carsonella ruddii genome
It is the bacteria with the smallest genome
How can you answer these questions?
If you want to learn more about the science, read:
Roten C-AH, Gamba P, Barblan J-L, Karamata D. Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Research. 2002;30(1):142-144