March 15, 2019
for(){}
, while(){}
if(){}
, if(){} else {}
fac(N) = N * fac(N-1)
plot()
, segments()
read.fasta()
FASTA files show only one strand
We call it the forward strand
The other strand is not shown, but can be calculated
We call it the reverse strand
According to Watson & Crick, the reverse strand is complementary
a
by t
and vice-versac
by g
and vice-versaMoreover, the reverse strand is reversed
opposite_strand <- function(dna) { for(i in 1:length(dna)) { j <- length(dna) - i + 1 if(dna[j]=='a') { ans[i] <- 't' } else if(dna[j]=='c') { ans[i] <- 'g' } else if(dna[j]=='g') { ans[i] <- 'c' } else if(dna[j]=='t') { ans[i] <- 'a' } } return(ans) }
Warning
There is something wrong here
opposite_strand(dna)
Error in opposite_strand(dna): object 'dna' not found
We can use the debugger to see what is happening
We cannot use ans[i]
if the vector ans
does not exist
It has to be created before
We can create ans
as a vector of NA
values
ans <- rep(NA, size)
We have the new vector, but we don’t know yet the values
We can create ans
with this command
ans <- rep(NA, size)
For this, we need to know the size
of the new vector
In this case it must be the same size as the dna
vector
(because opposite strands have the same length)
opposite_strand <- function(dna) { ans <- rep(NA, length(dna)) for(i in 1:length(dna)) { j <- length(dna) - i + 1 if(dna[j]=='a') { ans[i] <- 't' } else if(dna[j]=='c') { ans[i] <- 'g' } else if(dna[j]=='g') { ans[i] <- 'c' } else if(dna[j]=='t') { ans[i] <- 'a' } } return(ans) }
This is the correct version
dna <-c('t','a','a','c','g','t') dna
[1] "t" "a" "a" "c" "g" "t"
opposite_strand(dna)
[1] "a" "c" "g" "t" "t" "a"
Will GC content change?
Will GC skew change?
DNA from any cell of all organisms has a 1:1 ratio of pyrimidine and purine bases
The amount of guanine is equal to cytosine and the amount of adenine is equal to thymine.
Chargaff E, Lipshitz R, Green C (1952). Composition of the deoxypentose nucleic acids of four genera of sea-urchin. J Biol Chem 195 (1): 155–160
Rudner, R; Karkas, JD; Chargaff, E (1968). Separation of B. Subtilis DNA into complementary strands. 3. Direct analysis. PNAS. 60 (3): 921–2.
A double-stranded DNA molecule globally has
Why this rule is always valid?
A single-stranded DNA molecule globally has
This is valid globaly for each one of the two DNA strands
If we look into small local parts of the genome then this may not be true
There is a richness of G over C and T over A in the leading strand
(and vice versa for the lagging strand)
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
Marie Touchon, Eduardo P.C. Rocha, (2008), From GC skews to wavelets: A gentle guide to the analysis of compositional asymmetries in genomic data,
Biochimie, Volume 90, Issue 4,
The relation between %G and %C is not constant over all the genome
In some parts of the genome there are more G than C
We can evaluate this using the GC skew \[\frac{G-C}{G+C}\]
nG <- sum(dna=="g") nC <- sum(dna=="c") ans <- (nG-nC)/(nG+nC)
What happens if nG+nC==0
?
When we examine a small part of the genome, it can happen that there are no G or C. Then, the previous program fails
This version will work better
nG <- sum(dna=="g") nC <- sum(dna=="c") if( (nG+nC) == 0) { ans <- 0 } else { ans <- (nG-nC)/(nG+nC) }
Local statistics look only through a small window
A Window is a part of the genome, from an initial position, and with a fixed size
For example, we can analyze a region of 500 bp starting at position 3000 of the genome
position 3000, size 500
window <- dna[3000:3499]
In the general case, we can write
window <- dna[start:(start+size-1)]
Remember the function seq()
for numeric sequences
window <- dna[seq(from=3000, length.out=500)]
In the general case, we can write
window <- dna[seq(from=start, length.out=size)]
seq()
functionseq(from, to, by, length.out)
All inputs are optional
Choose carefully which ones to use
seq()
seq(from=5, to=14)
[1] 5 6 7 8 9 10 11 12 13 14
seq(from=5, length.out=10)
[1] 5 6 7 8 9 10 11 12 13 14
seq(from=5, to=30, by=3)
[1] 5 8 11 14 17 20 23 26 29
gc_skew <- function(dna, start, size) { window <- dna[seq(from=start, length.out=size)] nG <- sum(window=="g") nC <- sum(window=="c") if( (nG+nC) == 0) { return(0) } else { return( (nG-nC)/(nG+nC) ) } }
gc_skew <- function(dna, start, size) { window <- dna[seq(from=start, length.out=size)] nG <- sum(window=="g" | window=="G") nC <- sum(window=="c" | window=="C") if( (nG+nC) == 0) { return(0) } else { return( (nG-nC)/(nG+nC) ) } }
gc_skew()
on different parts of the genomeWe traverse the genome with several windows of fixed size
Each window starts after the end of the previous one
In other words, the windows do not overlap
We start at the first DNA position, until the last
Be careful to not step out of DNA
win_pos <- seq(from=1, to=length(dna)-size, by=size)
Now we apply gc_skew()
to each element of win_pos
For example, for windows of size 100000
size <- 1E5 win_pos <- seq(from=1, to=length(dna)-size, by=size) skew <- rep(NA, length(win_pos)) for(i in 1:length(win_pos)) { skew[i] <- gc_skew(dna, win_pos[i], size) }
The result depends on window’s size
size=1E5
size=1E4
size=1E3
size=1E2
The reason for this asymmetry is the DNA replication process
The origin of replication is in one of the sign changing places
We want to write a program so we can find Ori on many different genomes
We can answer questions like:
The values can change a lot between a window and the next one
skew
data is noisy. A lot of small random changes
Instead of looking directly at skew
, we can look at cumulative skew
\[\text{acc_skew}[i]=\sum_{j=1}^i \text{skew}[j]\]
Now, instead of the change of sign, we look for the minimum and maximum (Why?)
Everyday you spend some money
Some days you receive money
How much money you have each day?
The income can be positive (salary) or negative (expenses)
The balance on first day is your initial money
The balance of today is the balance of yesterday plus the income of today
balance[today] <- balance[yesterday] + income[today]
Since they have different values every day, we represent the income on day i
by income[i]
We also represent the balance on day i
by balance[i]
The vector income
has the values for all days
The element income[i]
has the value for a single day
We will see several “days”, represented by i
It is useful to think that
i
represents “today”i-1
represents “yesterday”The balance on first day is your initial money
balance[1] <- income[1]
The balance of today is the balance of yesterday plus the income of today
for(i in 2:length(income)) { balance[i] <- balance[i-1] + income[i] }
The GC skew in each window is like the income in each day
And we want to see the “balance”: cumulative skew
acc_skew[1] <- skew[1] for(i in 2:length(skew)) { acc_skew[i] <- acc_skew[i-1] + skew[i] }
Same pattern, same solution
How can we calculate acc_skew
?
It is easy to see that
acc_skew[i]=acc_skew[i-1] + skew[i]
So we can write something like this:
acc_skew[1] <- skew[1] for(i in 2:length(skew)) { acc_skew[i] <- acc_skew[i-1] + skew[i] }
Can you write this as a function?
max
& min
show the change of signThe maximum value of acc_skew
is
max(acc_skew)
[1] 4.9877
It is not really important
We are looking for position, not value of maximum
This is the index of the maximum in the acc_skew
vector
which.max(acc_skew)
[1] 155
This is always an integer number, between 1 and length(acc_skew)
It is not the position in the genome
Since acc_skew
vector has the same size as win_pos
, we can find the Ori position with this
win_pos[which.max(acc_skew)]
[1] 1540001
win_pos
contains the genome positions of each window
This code gives us the position of the window with maximum GC-skew
Ori is located in the interval between
win_pos[which.max(acc_skew)]
win_pos[which.max(acc_skew)] + size
First plot is GC-skew versus Position
Second plot is AT-skew versus Position
What is the relationship between GC-skew and AT-skew?
How can you show it?
These ideas are useful to understand the reality
A System is a set of parts that interact
Each part can be a smaller system
The behavior of the system depends on the parts AND on the interactions
Complex systems can have emergent properties
That is, the system can do things that the parts cannot
The system is different than each of the parts
In a Petri dish we have one cell
Every hour, 10% of the cells replicate, and they never die
How many cells we have after i
hours?
There is only one element: the cells
There are two values important to them:
The vector ncells
is the number of cells
The value ncells[i]
is the number of cells on time i
The vector growth
is the growth of the number of cells
There are growth[i]
new cells between time i-1
and i
Let’s simulate for 100 hours
First we create empty vectors to store the result
ncells <- rep(NA, 100) growth <- rep(NA, 100)
Then we start with 1 cell/cm2 and zero growth
ncells[1] <- 1 growth[1] <- 0
growth[i] <- ncells[i-1]*0.1
The growth on time i
depends on the cells existing on time i-1
ncells
will be the cumulative sum of growth
But growth
depends on ncells
Therefore, we calculate both together
How many cells we have after 100 hours?
ncells[1] <- 1 growth[1] <- 0 for(i in 2:100) { growth[i] <- ncells[i-1]*0.1 ncells[i] <- ncells[i-1] + growth[i] } ncells[100]
[1] 12527.83
In practice cells never grow that much
There is a limited amount of food
So we need to include food in the system
We represent it with the vector food
Let’s say that initially food[1] <- 20
[ng/uL]
More cells eat more food, limited by the eat_rate
:food_change[i] <- ncells[i-1]*food[i-1]*eat_rate
We assume that when a cell gets food, it divides:growth[i] <- ncells[i-1]*food[i-1]*eat_rate
What happens now?
Make a program to simulate this system
What happens when food finish?
Is that realistic?
ncells <- growth <- rep(NA, 100) food <- food_change <- rep(NA, 100) ncells[1] <- 1 growth[1] <- 0 food[1] <- 20 food_change[1] <- 0 eat_rate <- 0.003 for(i in 2:100) { growth[i] <- ncells[i-1]*food[i-1]*eat_rate food_change[i] <- -ncells[i-1]*food[i-1]*eat_rate ncells[i] <- ncells[i-1] + growth[i] food[i] <- food[i-1] + food_change[i] }
ncells <- growth <- rep(NA, N) food <- food_change <- rep(NA, N) ncells[1] <- initial_cells growth[1] <- 0 food[1] <- initial_food food_change[1] <- 0 for(i in 2:N) { growth[i] <- ncells[i-1]*food[i-1]*eat_rate food_change[i] <- -ncells[i-1]*food[i-1]*eat_rate ncells[i] <- ncells[i-1] + growth[i] food[i] <- food[i-1] + food_change[i] }