Class 5

March 8th, 2016

Welcome

to “Computing for Molecular Biology 2”

Plan for Today

Global v/s local genome statistics
- GC content
- GC skew
What are they useful for
How to calculate them
- function
- lapply/sapply
- cumsum

What do we know?

Some organisms have been sequenced
We can read the sequences from FASTA files
read.fasta returns a list of vectors of text
We can create functions to do things with the sequence
Functions are black boxes with inputs and outputs
For example we do statistics of the sequence

Why we do statistics on the sequence?

Statistics is a way to tell a story that makes sense of the data

In genomics, we look for biological sense

That story can be about the complete genome

for example, the average GC content

or can be about some region of the genome

for example, the GC skew on regions of size 10K

What is the story told by GC skew?

Remember Chargaff second rule

For each DNA strand we have \(\%A \approx \%T\) and \(\%G \approx \%C\)

This is because the substitution rate is presumably equal

Hence, the second parity rule only exists when there is no mutation or substitution

But the ratio of G over C is not uniform over the genome

Why?

GC skew and genome replication

GC skew changes sign at the boundaries of the two replichores

This corresponds to DNA replication origin or terminus

The replication origin is usually called ori.

##How DNA replicates {.no-gap}

How to calculate values on regions

All our calculations are done using functions

What should be

the output?
the input(s)?
the name?

Defining a GC skew function

Remember that an R function is defined like this

name <- function(input1, input2, ...) {
    Calculate
    return(output)
}

The inputs

The GC skew result should depend on:

the genomic sequence
the start of the window
the length of the window

These are the parameters of the function.
Do they have

a name?
a default value?

The output

How do we transform the input parameters into the output value?

We can use any R function available, such as

seq(from, to, by, length_out)

Task 1: write a gc.skew function

Applying the function to many values

We want to evaluate gc.skew on different positions of the genome

pos <- seq(from=1, to=length(s[[1]]), by=10000)

How do we apply gc.skew to each element of pos?

A side note: simplify notation

Since we are working with a single sequence we can do

s1 <- s[[1]]

It is shorter and less error-prone

The `sapply` function

sapply(X, FUN, ...)

sapply returns a vector of the same length as X, each element of which is the result of applying FUN to the corresponding element of X

X: a vector or list
FUN: the function to be applied to each element of X
…: optional arguments to FUN

Put it on your toolbox

Example of lapply

s1[1:8]

[1] "a" "g" "c" "t" "t" "t" "t" "c"

sapply(s1[1:8], DNA.to.RNA)

  a   g   c   t   t   t   t   c 
"a" "g" "c" "u" "u" "u" "u" "c"

Can you guess the DNA.to.RNA function?

In summary: inputs are vector and function, output is the result of function applied to each element of the input vector

My Version of `DNA.to.RNA`

DNA.to.RNA <- function(base) {
  if(base=="t") {
    return("u")
  } else {
    return(base)
  }
}

Please describe how it works
Why we use indentation?

Homework

There are other apply functions. Describe them

lapply
mapply
tapply
apply
replicate
mclapply

Task 2:

Write an HTML document (using Rmarkdown) describing the location of the replication origin (ori).

You can use the same function three times with different parameters to draw the GC skew of E.coli for windows of length 1k, 10K and 100K.

How would you validate experimentally this result?
How do we find the exact position of the replication origin?

Homework

For next week

No winter school this time

What is the genetic code? How was it discovered?
Given the FASTA of a prokaryotic coding gene, How can you get the sequence of the protein?
Write a function to transform the sequence of a gene into the corresponding protein
How can we combine a FASTA file and a GFF file to get the gene sequences?
How can we read a GFF file on R?
What is an ORF? What is a CDS?

Long term

We need a summary of the previous classes, including the 1st semester

If you write the summary you will learn more
And I will learn what I did wrong
Choose one class and summarize it on Rmarkdown # Make it ~~easy~~ right!

Welcome

to “Computing for Molecular Biology 2”

Plan for Today

What do we know?

What do we know?

Why we do statistics on the sequence?

What is the story told by GC skew?

Remember Chargaff second rule

GC skew and genome replication

How to calculate values on regions

Defining a GC skew function

The inputs

The output

Applying the function to many values

A side note: simplify notation

The sapply function

Example of lapply

My Version of DNA.to.RNA

Homework

Task 2:

Homework

For next week

No winter school this time

Long term

The `sapply` function

My Version of `DNA.to.RNA`