- Computational Thinking
- Decomposition
- Pattern Recognition
- Abstraction
- Algorithm Design
- Not about memory
- About Problem solving
April 7, 2020
for(){}
, while(){}
if(){}
, if(){} else {}
plot()
, segments()
read.fasta()
if(){}
has two parts: condition and commandFASTA files show only one strand
We call it the forward strand
The other strand is not shown, but can be calculated
We call it the reverse strand
According to Watson & Crick, the reverse strand is complementary
a
by t
and vice-versac
by g
and vice-versaMoreover, the reverse strand is reversed
reverse_complement <- function(dna) { for(i in 1:length(dna)) { j <- length(dna) - i + 1 if(dna[i]=='a') { answer[j] <- 't' } else if(dna[i]=='c') { answer[j] <- 'g' } else if(dna[i]=='g') { answer[j] <- 'c' } else if(dna[i]=='t') { answer[j] <- 'a' } } return(answer) }
Warning
There is something wrong here.
We cannot use answer[i]
if the vector answer
does not exist
It has to be created before
We can create answer
as a vector of NA
values
answer <- rep(NA, size)
We have the new vector, but we don’t know yet the values
We can create answer
with this command
answer <- rep(NA, size)
For this, we need to know the size
of the new vector
In this case it must be the same size as the dna
vector
(because opposite strands have the same length)
reverse_complement <- function(dna) { answer <- rep(NA, length(dna)) for(i in 1:length(dna)) { j <- length(dna) - i + 1 if(dna[j]=='a') { answer[i] <- 't' } else if(dna[j]=='c') { answer[i] <- 'g' } else if(dna[j]=='g') { answer[i] <- 'c' } else if(dna[j]=='t') { answer[i] <- 'a' } } return(answer) }
This is the correct version
dna <-c('t','a','a','c','g','t') dna
[1] "t" "a" "a" "c" "g" "t"
reverse_complement(dna)
[1] "a" "c" "g" "t" "t" "a"
Will GC–content change?
Will GC–skew change?
Local statistics look only through a small window
A Window is a part of the genome, from an initial position, and with a fixed size
For example, we can analyze a region of 500 bp starting at position 3000 of the genome
The function seq()
makes numeric sequences
window <- dna[seq(from=3000, length.out=500)]
In the general case, we can write
window <- dna[seq(from=start, length.out=size)]
seq()
functionseq(from, to, by, length.out)
All inputs are optional
Choose carefully which ones to use
seq()
seq(from=5, to=14)
[1] 5 6 7 8 9 10 11 12 13 14
seq(from=5, length.out=10)
[1] 5 6 7 8 9 10 11 12 13 14
seq(from=5, to=30, by=3)
[1] 5 8 11 14 17 20 23 26 29
gc_skew()
on different parts of the genomeWe traverse the genome with several windows of fixed size
Each window starts after the end of the previous one
In other words, the windows do not overlap
We start at the first DNA position, until the last
Be careful to not step out of DNA
win_pos <- seq(from=1, to=length(dna)-size, by=size)
Now we apply gc_skew()
to each element of win_pos
The result depends on window’s size
size=1E5
size=1E4
size=1E3
size=1E2
The relation between %G and %C is not constant over all the genome
In some genome regions there are more G than C
We can evaluate this using the GC skew \[\frac{G-C}{G+C}\]
There is a richness of G over C and T over A in the leading strand
(and vice versa for the lagging strand)
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
Marie Touchon, Eduardo P.C. Rocha, (2008), From GC skews to wavelets: A gentle guide to the analysis of compositional asymmetries in genomic data,
Biochimie, Volume 90, Issue 4,
The reason for this asymmetry is the DNA replication process
The origin of replication is in one of the sign changing places
We want to write a program so we can find Ori on many different genomes
We can answer questions like:
The values can change a lot between a window and the next one
skew
data is noisy. A lot of small random changes
Instead of looking directly at skew
, we can look at accumulative skew
\[\text{acc_skew}[i]=\text{skew}[i]+\text{skew}[i-1]+\cdots+\text{skew}[1]\]
Now, instead of the change of sign, we look for the minimum and maximum (Why?)
Everyday you spend some money
Some days you receive money
How much money you have each day?
The income can be positive (salary) or negative (expenses)
The balance on first day is your initial money
The balance of today is the balance of yesterday plus the income of today
balance[today] <- balance[yesterday] + income[today]
Since they have different values every day, we represent the income on day i
by income[i]
We also represent the balance on day i
by balance[i]
The vector income
has the values for all days
The element income[i]
has the value for a single day
We will see several “days”, represented by i
It is useful to think that
i
represents “today”i-1
represents “yesterday”The balance on first day is your initial money
balance[1] <- income[1]
The balance of today is the balance of yesterday plus the income of today
for(i in 2:length(income)) { balance[i] <- balance[i-1] + income[i] }
The GC skew in each window is like the income in each day
And we want to see the “balance”: accumulative skew
acc_skew[1] <- skew[1] for(i in 2:length(skew)) { acc_skew[i] <- acc_skew[i-1] + skew[i] }
Same pattern, same solution
Can you write this as a function?
max
& min
show the change of signThe maximum value of acc_skew
is
max(acc_skew)
[1] 4.9877
It is not really important
We are looking for position, not value of maximum
This is the index of the maximum in the acc_skew
vector
which.max(acc_skew)
[1] 155
This is always an integer number, between 1 and length(acc_skew)
It is not the position in the genome
Since acc_skew
vector has the same size as win_pos
, we can find the Ori position with this
win_pos[which.max(acc_skew)]
[1] 1540001
win_pos
contains the genome positions of each window
This code gives us the position of the window with maximum GC-skew
Ori is located in the interval between
win_pos[which.max(acc_skew)]
win_pos[which.max(acc_skew)] + size