Question 1 of Midterm exam dealt with codon usage. The first three parts are identical to the final exam, so we will not repeat the discussion here.
1.4 Absolute to relative frequencies
Write a function called
count_to_frequency()
that takes a single vector with integer numbers (such as the number of times each codon appears), and returns a new vector of the same size with the relative frequencies. In other words, each value is divided by the total.
The input is the vector number_of_codons
. The total is
sum(number_of_codons)
. Therefore the answer is just
this:
<- function(number_of_codons) {
count_to_frequency /sum(number_of_codons)
number_of_codons }
1.5 Apply count_to_frequency()
Calculate the relative frequencies of codon usage for each gene, and in total. Create a vector called
cell_codon_frequency
using the functioncount_to_frequency()
ontotal_codon_count
. Then create a list calledgene_codon_frequency
. Each element of the list contains the result of usingcount_to_frequency()
applied to each element ofgenes_codon_count
.
The first part is trivial. We just “Create a vector called
cell_codon_frequency
using the function
count_to_frequency()
on total_codon_count
.”
Translate English to R.
<- count_to_frequency(total_codon_count) cell_codon_frequency
The second part is easy, and follows a pattern we have seen before.
We create an empty list using the function list()
. Notice
that, unlike vectors, it is not easy to make a list of a predetermined
size. But it is not important, since the list grows automatically.
<- list()
gene_codon_frequency for(i in 1:length(genes_codon_count)) {
<- count_to_frequency(genes_codon_count[[i]])
gene_codon_frequency[[i]] }
You can also use the advanced function lapply()
. This
allows us to solve this kind of questions in one single line.
<- lapply(genes_codon_count, count_to_frequency) gene_codon_frequency
These two codes are equivalent, but the second is faster and shorter.
1.6 Absolute distance
Write a function called
abs_distance(a, b)
that takes two vectors and returns the sum of the absolute values of eacha[i]
minusb[i]
. The result is a single non-negative number.
Several possible solutions. The first one uses an auxiliary variable to accumulate the sum
<- function(a, b) {
abs_distance <- 0
add for(i in 1:length(a))
<- add + abs(a[i]-b[i])
add return(add)
}
For some people it is easier to think about building a vector with the absolute difference, and then add all elements
<- function(a, b) {
abs_distance <- rep(NA, length(a))
ans for(i in 1:length(a)) {
<- abs(a[i]-b[i])
ans[i]
}return(sum(ans))
}
If you remember that vectors can be combined with arithmetic
operations, you can rewrite the last solution without using
for()
, as this:
<- function(a, b) {
abs_distance <- abs(a-b)
ans return(sum(ans))
}
In this version the ans
vector is built in one step.
Faster and shorter. If you want it even more short, skip
ans
and go directly:
<- function(a, b) {
abs_distance sum(abs(a-b))
}
An example of a wrong answer:
## WRONG CODE
<- function(a, b) {
abs_distance # what is `length(a, b)`?
for(i in 1:length(a, b)) {
<- abs(a[i]-b[i])
add # the variable `add` gets only one value
# it is updated on every loop
# at the end it gets only the last `abs(a[i]-b[i])`
# the rest is forgotten
}return(sum(add))
# this sum is only adding one number
# the rest is forgotten
}
It is interesting to think about lenght(a, b)
. Probably
the student was thinking that we should consider the length of both
vectors. The question is “how to combine them?”. If we do
length(c(a, b))
we get a number that is too big:
length(a) + length(b)
. This is bigger than both
vectors.
It is much safer to consider max(length(a), length(b))
,
but we get into trouble again if one of the vectors is larger. In fact
the distance only makes sense if both vectors have the same length. This
is always the case in this exam.
1.7 Calculate all distances
Calculate the distances between every vector in
gene_codon_frequency
and the vectorcell_codon_frequency
, and store them in a vector calleddistance
. The result contains one entry for each gene.
This is again a typical pattern, in which the same function is applied to each element of a list, and we assign the result to a vector.
<- rep(NA, length(gene_codon_frequency))
distance for(i in 1:length(gene_codon_frequency))
<- abs_distance(gene_codon_frequency[[i]],
distance[i] cell_codon_frequency)
As we discussed earlier, this can be done in one line with the
function sapply()
. You should really learn it.
<- sapply(gene_codon_frequency, abs_distance, cell_codon_frequency) distance
1.8 Find the most different gene.
Write the code to find the name of the gene which has the greatest value on
distance
.
The greatest value on distance
is found using
max(distance)
but we do not want that. We want the position of the greatest value
which.max(distance)
and then we use that position as an index for the vector
names(genes)
names(genes)[which.max(distance)]
1.9 (Bonus) Find the 6 genes most different from
cell_codon_frequency
.
To get the most different genes, you need to sort the
distance
vector
sort(distance, decreasing = TRUE)
but this will give you all the genes, and we want only the first 6. We can use an index
sort(distance, decreasing = TRUE) [1:6]
or we can use the function head()
, which —who would
guess— gives us the first 6 elements
head(sort(distance, decreasing = TRUE))
The problem is that we get the values, not the names. One solution is
to assign names to the distance
vector, and then look at
the names of the top genes.
names(distance) <- names(genes)
names(head(sort(distance, decreasing = TRUE)))
Another possibility is to replace sort()
by
order()
, which is a more general way to solve these
questions:
names(genes)[head(order(distance, decreasing = TRUE))]