The second exam question is a nice one. It is not too hard, not too easy, and is very practical. As you can read in the references given in the exam, this is what real molecular biologists use to identify horizontal transference. World-class molecular biology is 50% computing.
2.1 Load all genes
Download the FASTA file containing all the genes of your bacteria and read them into the list
genes
.
This was asked also in the midterm, and it is very easy. Everybody can have a different filename. The filename is not important.
library(seqinr)
<- read.fasta("NC_000913.ffn") genes
There are two typical errors in this question. Some people forgot to
say library(seqinr)
so the code does not work. On the other
hand, other people included install.packages("seqinr")
.
That is wrong. Installing packages is never part of the answer, since
you must install in only once, and the code is executed several
times.
Use just these two lines.
2.2 Count codons in each gene
Create a list called
genes_codon_count
, containing one vector for each gene in the bacterial genome. Each vector is the result of usingcodon_count()
on each gene.
This is also from the midterm exam. Very easy. With a
for()
loop we apply the function codon_count()
to each element of the list genes
, and we store the result
in the list genes_codon_count
.
<- list()
genes_codon_count for(i in 1:length(genes)) {
<- codon_count(genes[[i]])
genes_codon_count[[i]] }
It is important to use double brackets to work with list elements.
Also, keep in mind that the input of the codon_count()
function can only be a vector, not a list.
There is a difference between vectors and lists when we create them.
In most cases we can know the vector size before we know its content, so
we can use rep(NA, size)
. But we cannot do that for lists.
It is not easy to prepare a list with a fixed size. So we create
genes_codon_count
using
genes_codon_count <- list()
and do not worry about its
size. We then just add elements, using the double brackets, like in
genes_codon_count[[i]] <- codon_count(genes[[i]])
.
Some clever students used the advanced function
lapply()
. This allows us to solve this kind of questions in
one single line.
<- lapply(genes, codon_count) genes_codon_count
It is worth learning about lapply()
and similar
functions. This is even faster if you work in a server, since
lapply()
can be done in multiple processors at the same
time. That allows us to run complex code very fast.
2.3 Count codons in all genes
Create the vector
total_codon_count
containing the total number of times that each codon appears on all genes.
Many people got confused in this question. That is a good way to measure if the student understands or not. If you only study, you may not understand. Studying is not enough. You need to learn.
The direct way to solve this question is to just add all the values, one by one. That was explained in the blog. Apparently people get confused when one variable gets updated. You should not. Variables are variable, that is, they change. Otherwise they will be called constants. Look at the blog and at this code.
<- rep(0, 64)
total_codon_count names(total_codon_count) <- valid_codons
<- total_codon_count + genes_codon_count[[1]]
total_codon_count <- total_codon_count + genes_codon_count[[2]]
total_codon_count <- total_codon_count + genes_codon_count[[3]]
total_codon_count <- total_codon_count + genes_codon_count[[4]]
total_codon_count <- total_codon_count + genes_codon_count[[5]]
total_codon_count <- total_codon_count + genes_codon_count[[6]]
total_codon_count <- total_codon_count + genes_codon_count[[7]]
total_codon_count <- total_codon_count + genes_codon_count[[8]]
total_codon_count <- total_codon_count + genes_codon_count[[9]]
total_codon_count <- total_codon_count + genes_codon_count[[10]]
total_codon_count
…<- total_codon_count +
total_codon_count length(genes_codon_count)]] genes_codon_count[[
Obviously this is not practical. It is too complicated, it is easy to make mistakes, and you really do not know how many genes are in the organism you will study. Notice that your code should work for all organisms, not only for the one you got assigned.
So, if you understand this question, you realize that you must use a
for(){}
loop.
<- rep(0, 64)
total_codon_count names(total_codon_count) <- valid_codons
for(i in 1:length(genes_codon_count)) {
<- total_codon_count + genes_codon_count[[i]]
total_codon_count }
That is the solution. It is very easy, since you can —and should— add vectors to vectors. If you do not feel comfortable adding vectors, you can do like this:
<- rep(0, 64)
total_codon_count names(total_codon_count) <- valid_codons
for(i in 1:length(genes_codon_count)) {
for(j in valid_codons) {
<- total_codon_count[j] + genes_codon_count[[i]][j]
total_codon_count[j]
} }
But really, get over it. Add vectors to vectors.
2.4 Compute the relative adaptiveness of each codon
We need to calculate the relative adaptiveness of each codon, using the values in the
total_codon_count
vector.You can calculate the adaptiveness of every codon, and store only these codons it in the vector
codon_adaptness
, by doing the following steps:
- For each amino-acid AA:
- Find the codons that encode AA it and store them in a vector called
codons_for_aa
.- Find the maximum of
total_codon_count
among the codons incodons_for_aa
. Look only at these codons, ignore the rest. Store this maximum inmax_count_for_aa
.- For each codon in
codons_for_aa
:
- Find the value of
total_codon_count
for that particular codon.- Divide it by
max_count_for_aa
.- Calculate the logarithm of the result of the last division, and store it in
codon_adaptness
using the codon as the index.- Repeat for each codon in
codons_for_aa
.- Repeat for each amino acid.
Besides this description, this code was provided:
<- list(Ala = c("gca", "gcc", "gcg", "gct"),
genetic_code Arg = c("aga", "agg", "cga", "cgc", "cgg", "cgt"),
Asn = c("aac", "aat"), Asp = c("gac", "gat"),
Cys = c("tgc", "tgt"), Gln = c("caa", "cag"),
Glu = c("gaa", "gag"), Gly = c("gga", "ggc", "ggg", "ggt"),
His = c("cac", "cat"), Ile = c("ata", "atc", "att"),
Leu = c("cta", "ctc", "ctg", "ctt", "tta", "ttg"),
Lys = c("aaa", "aag"), Phe = c("ttc", "ttt"),
Pro = c("cca", "ccc", "ccg", "cct"),
Ser = c("agc", "agt", "tca", "tcc", "tcg", "tct"),
Thr = c("aca", "acc", "acg", "act"),
Tyr = c("tac", "tat"), Val = c("gta", "gtc", "gtg", "gtt"))
<- rep(0, 64)
codon_adaptness names(codon_adaptness) <- valid_codons
To get the solution, we translate the instructions from English to R language.
# For each amino-acid _AA_:
for(aa in 1:length(genetic_code)) {
# Find the codons that encode _AA_ it and store them in a vector called `codons_for_aa`.
<- genetic_code[[aa]]
codons_for_aa # Find the maximum of `total_codon_count` _among the codons in_ `codons_for_aa`.
# Look only at these codons, ignore the rest.
# Store this maximum in `max_count_for_aa`.
<- max(total_codon_count[codons_for_aa])
max_count_for_aa # For each codon in `codons_for_aa`:
for(j in 1:length(codons_for_aa)) {
<- codons_for_aa[j]
codon # Find the value of `total_codon_count` for that particular codon.
# Divide it by `max_count_for_aa`.
# Calculate the logarithm of the result of the last division,
# and store it in `codon_adaptness` using the codon as the index.
<- log(total_codon_count[codon]/max_count_for_aa)
codon_adaptness[codon] # Repeat for each codon in `codons_for_aa`.
}# Repeat for each amino acid.
}
Most of the problems that students show in their answers are related to two things: confusing lists and vectors, and remembering the different ways to use indices.
The goal of this course is to learn to use a computer, the same way you learn to use a bike or you learn to walk and talk. If you really learn, you never forget. It is not about memory, it is about understanding, which you show when you use your tools in a new way.
For example, to go through all amino-acids, we can use a
for()
loop with aa in 1:length(genetic_code)
or we can use aa in names(genetic_code)
. In the first case
the variable aa
takes numeric values like
1, 2, …, 20
. In the second case the variable
aa
takes text values like
"Ala", "Arg", "Asn", …, "Val"
. Both can be used as indices
for the list genetic_code
:
- Positive numbers can always be used as indices
- Text can be used as indices, if the list elements have names. The
elements of
genetic_code
have names: the amino-acid names.
This is a second valid solution, using text vectors instead of numeric vectors
for(aa in names(genetic_code)) {
<- genetic_code[[aa]]
codons_for_aa <- max(total_codon_count[codons_for_aa])
max_count_for_aa for(codon in codons_for_aa) {
<- log(total_codon_count[codon]/max_count_for_aa)
codon_adaptness[codon]
} }
You can even read it as if it was written in English language. This
answer can be shortened if we remember that we can use a vector as an
index. This allows us to get rid of the internal for()
loop:
for(aa in names(genetic_code)) {
<- genetic_code[[aa]]
codons_for_aa <- max(total_codon_count[codons_for_aa])
max_count_for_aa <- log(total_codon_count[codons_for_aa]/max_count_for_aa)
codon_adaptness[codons_for_aa] }
Finally, the code can be even shorter if the main for()
loop goes directly on the list genetic_code
. Remember that
a for()
loop can go over each element of a vector or a
list.
for(codons_for_aa in genetic_code) {
<- total_codon_count[codons_for_aa]
counts_for_aa <- log(counts_for_aa/max(counts_for_aa))
codon_adaptness[codons_for_aa] }
That is three lines of code.
The main problem in this question was misunderstanding what was
requested. Several people combined all codons in one big vector, using
something like unlist(genetic_code)
. This does not work,
because we need to calculate the most used codon of each
amino-acid. If we mix all the codons then you cannot tell which
codons correspond to each amino-acid.
It seems like some people believe that the question is “tricky”, as if the professor is intentionally making things harder, and you have to undo it. That is a wrong idea. In reality the professor gives you all the parts that you need to solve the problem easily. The hardest parts are already solved for you. You only need to do the easy part.
Remember that CAI is a real tool used for scientific research. You
can read about it in the paper mentioned in the question, or you can
search the internet for it. Moreover, there is a cai()
function in R. The math in the exam question is a little more easy than
the real CAI, just to make a nicer exam.
2.5 Evaluate CAI for one gene
Write a function, called
gene_cai()
, that takes a vectornumber_of_codons
and returns the weighted average ofcodon_adaptness
for each codon.In other words, you must multiply each component of
number_of_codons
with the corresponding component ofcodon_adaptness
, sum all the results, and then divide by the sum of allnumber_of_codons
.
<- function(number_of_codons, codon_adaptness) {
gene_cai # we need a place to store the sum of all results
<- 0
ans # for each component of `number_of_codons`
for(i in 1:length(number_of_codons)) {
# multiply each component of `number_of_codons` with the corresponding
# component of `codon_adaptness` and sum all the results
<- ans + codon_adaptness[i]*number_of_codons[i]
ans
}# then divide by the sum of all `number_of_codons`
return(ans/sum(number_of_codons))
}
Some clever students remembered that you can multiply directly two
vectors, and sum them, so the for()
loop can be
avoided.
<- function(number_of_codons, codon_adaptness) {
gene_cai <- sum(number_of_codons * codon_adaptness)
ans return(ans/sum(number_of_codons))
}
and then we can realize that ans
is only used once, so
we can get rid of it.
<- function(number_of_codons, codon_adaptness) {
gene_cai return(sum(codon_adaptness * number_of_codons)/sum(number_of_codons))
}
That is two lines of code.
Some curious answers took the long way to calculate the sum of
number_of_codons
. For example some students wrote something
like
<- 0
acc for (i in length(number_of_codons)) {
<- number_of_codons[i] + acc_skew
acc_skew }
This is a correct answer, but unnecessarily complicated. Just say
sum(number_of_codons)
. Another unnecessarily complicated
code is
<- rep(0, 64)
m <- 0
add for(i in 1:64) {
<- number_of_codons[i]*codon_adaptness[i]
m[i] <- m[i] + add
add <- add/acc
ans }
There is no need for the m
vector. And the line
ans <- add/acc
does not need to be inside the
for()
loop, since we only care about it after we
added all the m[i]
. A short code, with the same idea, would
be
<- 0
add for(i in 1:64) {
<- number_of_codons[i]*codon_adaptness[i] + add
add
}<- add/acc ans
This is very similar to the first solution shown above.
2.6 Calculate CAI for all genes.
Calculate the Codon Adaptation Index for every gene and store these values in a vector called
CAI
.
This is similar to question 2.3. The only difference is that the result is a vector, not a list.
{r echo=FALSE} # prepare a vector to receive the answer CAI <- rep(NA, length(genes_codon_count)) for(i in 1:length(genes_codon_count)) { CAI[i] <- gene_cai(genes_codon_count[[i]], codon_adaptness) }
Like in question 2.3, this can be written in a single line using the
function sapply
, as in the following code:
<- sapply(genes_codon_count, codon_adaptness) CAI
The function sapply()
works exactly like
lapply()
, except that it returns a vector instead of a
list. It is worth learning about sapply()
,
lapply()
and similar functions.
2.7 Find the gene with the smallest CAI
Write the code to find the index of the smallest value in the vector
CAI
. Assign that index in the variableindex_of_smallest_CAI
.
Here the only important thing is to use which.min()
and
not min()
. We are not loonking for the smallest
value. Instead we are looking for the index that
points to the location of the smallest value.
<- which.min(CAI) index_of_smallest_CAI