I received some interesting answers to Homework 5. I want to share them with everybody, so everybody can learn.
Student 1
Student 1 wrote this answer to Question 1:
library(seqinr)
<- read.fasta("NC_000913.fna")
sequences
<- function(genome) {
genes_gc_content <- seq(from=1, to=length(genome)-window_size, by= window_size)
positions <-0
nA<-0
nG<-0
nC<-0
nT
for(i in 1:length(genome)) {
<- genome_gc_content(genome[[i]])
positions[i]if(genome[i]=="a") {
<- na+1
na else if(genome[i]=="t") {
}<- nt+1
nt else if(genome[i]=="g") {
}<- ng+1
ng else if(genome[i]=="c") {
}<- nc+1
nc
}
}<- (ng+nc)/(ng+nc+nt+na)
answerreturn(positions)
}
The first two lines are right, but they are not necessary. You can assume that the user of the function will do that work. Remember that there are two scenarios: first when you define the function, and second when someone else uses the function.
The line
positions <- seq(from=1, to=length(genome)-window_size, by= window_size)
is for Question 2 and is useless here.
Then we have the line
positions[i]<- genome_gc_content(genome[[i]])
. For one
side, if we use genome[[i]]
it means that
genome
is a list. But on the next lines we use
genome
as a vector. So, what is it?
Now, read again the text of the question. The function
must be called window_gc_content
, not
genes_gc_content
as here. Also, it takes three inputs:
sequence
, position
, and size
.
There is a detail in the if()
commands. Here we test
only for lower case letters, but as we saw on classes, sometimes they
are upper-case letters.
Student 2
Student 2 wrote something like this:
<- function(coli_seq){
gc_content<- sum(coli_seq=="a"|| "A")
A<- sum(coli_seq=="t"|| "T")
T<- sum(coli_seq=="g"|| "G")
G<- sum(coli_seq=="c"|| "C")
C<- (G+C)/(G+C+A+T)
ansreturn(ans)
}
The symbol ||
represents the logical operation
“or”.
If x
and y
are logical values
(that is, TRUE
or FALSE
), then the expression
x||y
will be TRUE
if x
is true or
if y
is true, or both. It will be false only when
x
and y
are false.
This does not work, because we are mixing apples and oranges. The
part coli_seq=="a"
is a comparison between a vector of
letters and a letter; the result is a logic vector. Then the part
"A"
is a single letter, not a vector, not a logic
value.
As a human I can understand that the student wants to say ‘if the sequence is “a” or “A”’, but the computer is dumb and does not understand. We need to say ‘if the sequence is “a” or the sequence is “A”’.
Use help(``||``)
to se the
manual Moreover, the operand ||
may not be the
correct one, since it returns a single value. Better use |
,
as indicated in the help
page.
Student 3
Now we have this:
<- function (genome, position, size){
window_gc_content <- seq(from=250000, length=100)
genome <- 250000
position <- 100
size <- 0
nA <- 0
nC <- 0
nG <- 0
nT for(i in 1:length(seq(from=250000, length=100))) {
if(genome[i]=="A" | genome[i]=="a"){
<- nA+1
nA else if(genome[i]=="C" | genome[i]=="c"){
} <- nC+1
nC else if(genome[i]=="G" | genome[i]=="g"){
} <- nG+1
nG else if(genome[i]=="T" | genome[i]=="t"){
} <- nT+1
nT
}
}<- (nG+nC)/(nA+nT+nG+nC)
answer
return(answer)
}
This time the name and inputs are correct. But then immediately —in lines 2, 3, and 4— we are destroying the input. That is a “no–no”. Do not do that, never. Input is sacred. Treat it with utmost respect.
We need to answer the following questions:
- What is
genome
? a list? a vector? a number? - What is
position
? a list? a vector? a number? - What is
size
? a list? a vector? a number? - Are we using these inputs in the code?
The result should change when any of the inputs changes. Otherwise they are not inputs.
Student 4
Here I don’t even know who answered, because the answer file is like this:
#'title: "Homework 5"
#'subtitle: "Computing in Molecular Biology and Genetics II"
#'author: "Put your name here"
#'number: STUDENT_NUMBER
#'date: "March 11, 2020"
Remember that this is the last year that we teach this content. Next year we will do something completely different. Maybe Python, maybe other language.
This person has no name, no student number, and lives in the past. If this was the exam, this person gets 0. Not a good idea if you want to pass the course. There is only one way to pass this course: get good grades.
Be careful with these details. Be professional. If you are not careful in the computer, how can we trust you with expensive reactives in the lab? How can we trust you with the analysis of critical data?
Student 4 did something good: recycling the function
gc_content()
that we created in the class. This function
takes only one input: a vector of letters, representing a DNA
sequence.
Then we have this proposed solution:
<- function(sequence, position, size) {
window_gc_content <- rep("a", 1000)
sequen <- genome[[1]][seq(from=position, length=size)]
sequen for(i in 1: length(sequen)) {
<- gc_content(sequen[i])
sequen [i] <- gc_content(sequne)
answer
}return(sequne)
}
In the second line we assign a value to sequen
, and then
in the next line we change it immediately. So we do not need the second
line. The variable sequen
is a vector of character. It
takes part of genome[[1]]
, using
seq(from=position, length=size)
as index. This is a very
good idea.
On the other side, there is no genome
among the inputs.
Where did genome
come from?
Line 5 is weird. We assign a number to sequen[i]
, which
was before a vector of characters. So we changed letters to numbers.
That is like putting apples in the egg box.
It is a good idea to use the gc_content()
function. It
takes a vector of character, but in line 5 we give it
sequen[i]
, which is a single character, not a vector.
What about answer
? We create this variable but we never
use it.
Student 5
At the last minute I received this:
<- function(sequence, position, size)
window_gc_content for(i in 1:length(genome)) {
if(genome[i]=="A"| genome[i]=="a") {
<- nA+1
nA else if (genome[i]=="C" | genome[i]=="c"){
} <- nC+1
nC else if (genome[i]=="G" | genome[i]=="g") {
} <- nG+1
nG else if (genome[i]=="T" | genome[i]== "t") {
} <- nT+1
nT
}
}<- (nG+nC)/(nA+nT+nG+nC)
answer
<- rep("a", 1000)
answer <- seq(from=sequences[[1]][250000], length=1000)
answer return(answer)
}
The first line is nice, but it does not have the {
symbol. The second one uses genome
, which is not an input
nor something we created. It appeared from nowhere… it is
magic. In other words, it is wrong, since magic does not
exist.
Again, the three inputs are never used in the code. They should be used.
Finally, answer
is assigned three times. The first time
it gets a number, which is what it should be. The second time we destroy
the good answer and we replace it with a vector 1000 letters
"a"
. Finally we destroy it again and we assign a vector of
numbers, that do not depend on the GC content.
Moreover, the question says that sequences
is a vector
of character, not a list. So we cannot use sequences[[1]]
,
which is only valid for lists.
Finally, the function seq()
can only take numeric
inputs. In other words, when we say seq(from=x, length=y)
,
then x
and y
must be numbers. But
sequences[[1]][250000]
is not a number.
By the way, the numbers 1000
and 250000
should not be in your answer. They are examples. Your
function should not depend on them. Always think on the general case,
not the particular one.
Conclusion
“I am a great believer in luck. The harder I work, the more of it
I seem to have.”
Some famous guy
- Be sure to understand the question. If you do not understand, ask in the forum. Explain what do you understand and what you do not understand.
- This is LEGO. Identify all the pieces and understand how they
connect.
- Write the name of each variable, its structure (vector, list, data frame, etc.) and its type (numeric, logic, character, etc.).
- You can only use the inputs and variables that you create. If you did not create it, do not change it.
- The output should change if the input change. Check that each input is used somewhere.
- Understand what you have and what you want to have. It is like a biochemistry process.
- Sometimes it is useful to work backwards. Start with what you want
to have, and decompose it in simpler terms. For example, to get the
GC-skew, you need to know
nG
andnC
. The you just need to findnG
andnC
. - Look at previous examples and recycle them, by using the old
functions inside the new function.
- If you cannot recycle, then you can adapt the old code for the new case.
- It is always wise to return to our very first class: How to Solve It.