The midterm exam has three mandatory questions and one optional. All questions point to evaluate Computational Thinking skills: decomposition, pattern recognition, abstraction and algorithm design
1. Turtle draws Snow
This question asks us to make a recursive function. Have we seen them before?
Yes! We saw examples of recursive functions, such as the factorial, the Fibonacci and the one drawing trees. We made tree drawings on Quiz 2 and also in classes.
We have seen that recursive functions are functions that
call themselves. The factorial of N
is N
times
the factorial of N-1
. A tree (N
levels) has
trunk and branches, and each branch is a smaller tree (with
N-1
levels). We also saw that all recursive functions
always have an if()
to handle the exit condition,
typically when N==1
.
So we can start with a copy of the tree()
function. Here
it is:
<- function(n, length, angle) {
tree turtle_lwd(n)
turtle_forward(length)
if (n > 1) {
turtle_left(angle)
tree(n-1,length*0.8, angle)
turtle_right(2*angle)
tree(n-1, length*0.8, angle)
turtle_left(angle)
tree(n-1, length*0.8, angle)
}turtle_backward(length)
}
We have to change it to draw snow instead of trees.
First we change the name from tree
to snow
. We
also change the inputs. Instead of n
we use N
and instead of length
we use L
. There are no
more inputs to snow()
.
<- function(N, L) {
snow turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
turtle_left(angle)
snow(N-1,L*0.8, angle)
turtle_right(2*angle)
snow(N-1, L*0.8, angle)
turtle_left(angle)
snow(N-1, L*0.8, angle)
}turtle_backward(L)
}
This version does not work, because it uses angle
which
is not defined. Also, the question says that, instead of
L*0.8
, each part must be of length L/3
. The
important parts of the question are:
The turtle draws the first part, turns left 60 degrees, draws the second part, turns 120 degrees to the right, draws another part, turns 60 degrees left, and draws the last part.
[…]
In general snow levelN
of lengthL
is made of four parts of snow levelN-1
of lengthL/3
.
So we need to insert one snow level N-1
and we
know the angles:
<- function(N, L) {
snow turtle_lwd(N)
turtle_forward(L)
if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}turtle_backward(L)
}
We can test this function and it works 😀 but the drawing is weird
🤪. The lines have different width, and the snow is outside the box. We
prefer all lines of the same length, so we have to delete the
turtle_lwd(N)
command.
The command snow(1, 80)
is ok, but the others are
outside. So we need to do turtle_forward(L)
only when N==1
. The new version is
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
}if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
}turtle_backward(L)
}
This one is also weird, but at least we get most of the snow inside the box. It looks like we are drawing and moving outside. What can be wrong?
The first if()
is ok, because the question says
“Snow level 1 is just a straight line of length
L
”. The second if()
is also right. It
does exactly what the question asks.
What about the turtle_backward(L)
? The question does not
say anything about “going backwards” or “return to the initial
condition”. In fact each snow has to start at the end of the
previous snow. Let’s delete turtle_backward(L)
and see what
happens.
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
}if (N > 1) {
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
} }
Now it works. We can simplify it a last time using
if() {} else {}
.
<- function(N, L) {
snow if(N==1) {
turtle_forward(L)
else {
} snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
turtle_right(120)
snow(N-1, L/3)
turtle_left(60)
snow(N-1, L/3)
} }
Ready! Let’s go to the next question.
2. Algorithm design
We have to write a function, and we got this starting point:
<- function(x) {
locate_half # write your code here
}
Have we seen this before? Maybe is like finding minimum or
maximum. We have never programed our own version of
min()
, but it should not be so hard, isn’t?
Wait. There is min()
and which.min()
. This
question ask for an index, so it is more like
which.min()
. It is not about the value of the half
value, it is about the location of the half value.
The half value is the average of the minimum and the maximum. Let’s re-write this phrase in R language using the starting point:
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value return(half_value)
}
Is this enough? Let’s test:
<- 1:9
x locate_half(x)
## [1] 5
Good!
locate_half(x+20)
## [1] 25
Mmm…🤔 It does not look right. We got the half value, not
the location of the half value. Is like doing
min()
instead of which.min()
It cannot be that easy. The teacher does not asks
easy questions. And we are not lazy students.
.
Let’s read the question again. The question says:
The location of the half value of the vector
x
is the index of the first value that is equal or bigger than the half value ofx
.
Therefore we will need an index. Let’s say that the index is
i
. And we need to check each one of the components of
x
from the first to the last. So we need a
for(){}
loop using i
as a counter.
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
...return(...)
} }
For each element of x
pointed by i
we need
to make a decision. We want to know if x[i]
is bigger or
equal to half_value
. Every time we make a decision, there
must be an if(){}
. Let’s write that.
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(...)
}
} }
Now, what shall we do if we find that
x[i] >= half_value
? In that case we have found the
location of the half value. It is the index of
x[i]
, in other words, it is i
. We have to
return that number.
<- function(x) {
locate_half <- (min(x) + max(x))/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
} }
And we test again
<- 1:9
x locate_half(x)
## [1] 5
locate_half(x + 20)
## [1] 5
locate_half(x * x)
## [1] 7
locate_half(sqrt(x))
## [1] 4
Good! we got the expected result. Can we make it better?
Well, the question says that x
is growing, so
x[1]=min(x)
and x[length(x)]=max(x)
. Using
indices is faster than using functionsSpeed is not very important now, but if one day we do
big data then speed is essential. The program has to finish
before we get old.
so we can rewrite our function as:
<- function(x) {
locate_half <- (x[1] + x[length(x)])/2
half_value for(i in 1:length(x)) {
if(x[i] >= half_value) {
return(i)
}
} }
Let’s move to the next question.
3. Systems: Polymerase Chain Reaction
The PCR system is shown in the question as this:
Have we seen a similar system before?
Simulate several PCR cycles
Yes! It is the same plot we got for cell growth in Class 11 and in the blog document. This is pattern recognition again. A system depends only on how many circles connect to how many boxes with how many arrows. It does not depend on the names in the circles/boxes, or the way we draw them. This is abstraction again.
So we can start with the code for cell_culture()
from
the class.
<- function(N, replication_rate, cells_init, food_ini) {
cell_culture # create empty vectors
<- d_cells <- rep(NA, N)
cells <- d_food <- rep(NA, N)
food # initialize values
1] <- cells_init
cells[1] <- food_ini
food[1] <- d_food[1] <- 0
d_cells[for(i in 2:N) {
# update `d_cells` and `d_food`
<- +replication_rate * cells[i-1] * food[i-1]
d_cells[i] <- -replication_rate * cells[i-1] * food[i-1]
d_food[i] # update `cells` and `food` as a cumulative sum
<- cells[i-1] + d_cells[i]
cells[i] <- food[i-1] + d_food[i]
food[i]
}return(data.frame(cells, food))
}
We have to change the name of the function. We also change the names
of the variables, from cells
to DNA
, from
food
to primers
and from eating
to cycle
. We change also the names for the variables with
names like d_cells
. This is easy to do with the Search
& Replace option in the editor. Also,
replication_rate
becomes cycle_rate
, or just
rate
, as the question indicates. We got this:
<- function(N, rate, dna_ini, primers_ini) {
pcr # create empty vectors
<- d_dna <- rep(NA, N)
dna <- d_primers <- rep(NA, N)
primers # initialize values
1] <- dna_ini
dna[1] <- primers_ini
primers[1] <- d_primers[1] <- 0
d_dna[for(i in 2:N) {
# update `d_dna` and `d_primers`
<- +rate * dna[i-1] * primers[i-1]
d_dna[i] <- -rate * dna[i-1] * primers[i-1]
d_primers[i] # update `dna` and `primers` as a cumulative sum
<- dna[i-1] + d_dna[i]
dna[i] <- primers[i-1] + d_primers[i]
primers[i]
}return(data.frame(dna, primers))
}
Now we need to include the default values for
rate
and primer_ini
.
<- function(N, dna_ini, primers_ini=1e8, rate=1e-8) {
pcr # create empty vectors
<- d_dna <- rep(NA, N)
dna <- d_primers <- rep(NA, N)
primers # initialize values
1] <- dna_ini
dna[1] <- primers_ini
primers[1] <- d_primers[1] <- 0
d_dna[for(i in 2:N) {
# update `d_dna` and `d_primers`
<- +rate * dna[i-1] * primers[i-1]
d_dna[i] <- -rate * dna[i-1] * primers[i-1]
d_primers[i] # update `dna` and `primers` as a cumulative sum
<- dna[i-1] + d_dna[i]
dna[i] <- primers[i-1] + d_primers[i]
primers[i]
}return(data.frame(dna, primers))
}
Now we test and see if we did it right.
pcr(N=6, dna_ini=1e6)
## dna primers
## 1 1000000 100000000
## 2 2000000 99000000
## 3 3980000 97020000
## 4 7841396 93158604
## 5 15146331 85853669
## 6 28150012 72849988
Yup! It’s done. And we only used Search & Replace and pattern recognition.
PCR depends on initial concentration
This question asks us to write code directly, not any function. It gives us some initial data:
<- 10**(0:6)
initial_dna initial_dna
## [1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06
For each of these values we need to simulate 30 PCR cycles and get
the dna
value. Let’s see if we can do just one first.
pcr(N=30, dna_ini=initial_dna[1])
## dna primers
## 1 1.000000e+00 100000000.0
## 2 2.000000e+00 99999999.0
## 3 4.000000e+00 99999997.0
## 4 8.000000e+00 99999993.0
## 5 1.600000e+01 99999985.0
## 6 3.200000e+01 99999969.0
## 7 6.399998e+01 99999937.0
## 8 1.279999e+02 99999873.0
## 9 2.559997e+02 99999745.0
## 10 5.119987e+02 99999489.0
## 11 1.023995e+03 99998977.0
## 12 2.047979e+03 99997953.0
## 13 4.095916e+03 99995905.1
## 14 8.191665e+03 99991809.3
## 15 1.638266e+04 99983618.3
## 16 3.276263e+04 99967238.4
## 17 6.551454e+04 99934486.5
## 18 1.309861e+05 99869014.9
## 19 2.618007e+05 99738200.3
## 20 5.229161e+05 99477084.9
## 21 1.043098e+06 98956903.3
## 22 2.075315e+06 97924686.1
## 23 4.107561e+06 95892440.5
## 24 8.046401e+06 91953600.4
## 25 1.544536e+07 84554645.4
## 26 2.850512e+07 71494879.8
## 27 4.888482e+07 51115177.6
## 28 7.387239e+07 26127613.3
## 29 9.317348e+07 6826521.5
## 30 9.953399e+07 466013.9
Ok, we got something. Maybe too much, since we got dna
and primers
. We only want dna
. Since the
result is a data frame, we can use any kind of index to get the first
column, such as $dna
or [[1]]
or
[,"dna"]
. Data frames can be used in many different ways,
as we learned last semesterDid we learn that? At least it was in the
lectures.
.
Let’s try again
pcr(N=30, dna_ini=initial_dna[1])$dna
## [1] 1.000000e+00 2.000000e+00 4.000000e+00 8.000000e+00 1.600000e+01
## [6] 3.200000e+01 6.399998e+01 1.279999e+02 2.559997e+02 5.119987e+02
## [11] 1.023995e+03 2.047979e+03 4.095916e+03 8.191665e+03 1.638266e+04
## [16] 3.276263e+04 6.551454e+04 1.309861e+05 2.618007e+05 5.229161e+05
## [21] 1.043098e+06 2.075315e+06 4.107561e+06 8.046401e+06 1.544536e+07
## [26] 2.850512e+07 4.888482e+07 7.387239e+07 9.317348e+07 9.953399e+07
Now we got a vector. We need to do it seven times. We can do it one by one.
<- data.frame(V1=pcr(N=30, dna_ini=initial_dna[1])$dna,
conc V2=pcr(N=30, dna_ini=initial_dna[2])$dna,
V3=pcr(N=30, dna_ini=initial_dna[3])$dna,
V4=pcr(N=30, dna_ini=initial_dna[4])$dna,
V5=pcr(N=30, dna_ini=initial_dna[5])$dna,
V6=pcr(N=30, dna_ini=initial_dna[6])$dna,
V7=pcr(N=30, dna_ini=initial_dna[7])$dna)
It is not very elegant, but it works. Let’s test it.
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
We are ready. We can move forward.
But we can do better. If we still have time, we can improve this code to be more generic. We would like to have a program that works for any number of initial conditions. We have go step by step.
First we need an “empty” data frame. We cannot make a completely empty data frame, but we can start with a single column with an empty vector. Each simulation has 30 steps, so we need a vector of size 30.
<- data.frame(V1=rep(NA, 30)) conc
Now we need a loop for each element of initial_dna
:
for(i in 1:length(initial_dna)) {
... }
Inside the loop we have to fill each of the columns of
conc
. Can we do that, even if we have only one column? Yes,
since data frames can grow. Last semester we saw that it was quite easy
to add new columns to a data frame. We just use an index and assign a
vector. If the indexed column does not exist on the data frame, it is
created automagically. Let’s do it.
for(i in 1:length(initial_dna)) {
<- pcr(N=30, dna_ini=initial_dna[i])$dna
conc[[i]] }
Let’s see if this works:
plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
points(conc[[i]], pch=i, type="b")
}
This curves can be measured in real PCR experiments. In some cases we can measure the optical density (OD) through the PCR cycle. In other cases we can use fluorescence. Both methods give us a signal that is proportional to DNA concentration. If the real curves match our model, then our model is useful.
Now we can go home.
4. Bonus: Quantitative PCR
Bonus! This is optional, but it gives more points and more knowledge. Also, it makes us prouder of our accomplishments. So we do not go home yet.
Here again we have to write code, not a function. We have make a
vector called CT
, which contains the half value of
each column of the data frame conc
. We find the half
value with function locate_half()
which we already
created. And we have to do it for each column of conc
. Like
in the previous question we can do it one by one:
<- c(locate_half(conc[[1]]),
CT locate_half(conc[[2]]),
locate_half(conc[[3]]),
locate_half(conc[[4]]),
locate_half(conc[[5]]),
locate_half(conc[[6]]),
locate_half(conc[[7]]))
Hey, doing it one by one is boring 😒 and it makes
me feel like I’m working for the computer 😖, like in a dystopian movie.
Let’s make the computer work for us. If we have to do the same thing
more than two times, we write a for(){}
loop to do it for
us. Life is short, let’s leave the boring stuff for the computers.
We use the same pattern we have done many times. We create an empty
vector of the correct sizeWe can even be wrong about the size of the vector. If
we make a small vector it will grow as needed later. But is more
efficient to do it right from the beginning.
and then we use a loop.
<- rep(NA, ncol(conc))
CT for(i in 1:ncol(conc)) {
<- locate_half(conc[[i]])
CT[i] }
Then we draw the plot and we see the points in a “straight” line.
plot(initial_dna ~ CT, log="y")
This result is important because it means that there is a
relationship between the initial DNA concentration and the number of
cycles required to cross a fixed threshold. The name CT
means cycle threshold.
If we know this relationship, we can determine with high precision what was the initial DNA concentration. That is the idea of Quantitative PCR.
Let’s find that relationship We
remember that straight lines can be modeled with linear
modelsSo linear models are part of molecular
biology? Yes, very much. Good that we learned them last
semester.
. The name says it all. The plot uses log in the
vertical coordinate, so the linear model also needs logarithmic scale.
We can use any logarithm, since all do the trick. In this case we can
use log10
since it is easier to understand.
<- lm(log10(initial_dna) ~ CT)
model model
##
## Call:
## lm(formula = log10(initial_dna) ~ CT)
##
## Coefficients:
## (Intercept) CT
## 8.3241 -0.3006
Since Intercept is 8.3241 then the concentration corresponding to CT=0 is \(10^{8.3241}\)=2.1091137 × 108. With so much initial DNA the threshold is crossed before we start the experiment.
The slope constant for CT is -0.3, meaning that when CT increases by one, the initial DNA concentration decreases \(10^{-0.3}\)=0.5011872 times. That is like half. In other words, double DNA means one less cycle. This means that the PCR reaction duplicates the number of DNA molecules on each cycle. That happens only when we have 100% efficiency.
Let’s make a table:
<- data.frame(CT = seq(from=0, to=40, by=2))
ans $initial_DNA <- 10^predict(model, newdata = ans)
ans::kable(ans, format.args=list(digits=2)) knitr
CT | initial_DNA |
---|---|
0 | 2.1e+08 |
2 | 5.3e+07 |
4 | 1.3e+07 |
6 | 3.3e+06 |
8 | 8.3e+05 |
10 | 2.1e+05 |
12 | 5.2e+04 |
14 | 1.3e+04 |
16 | 3.3e+03 |
18 | 8.2e+02 |
20 | 2.1e+02 |
22 | 5.2e+01 |
24 | 1.3e+01 |
26 | 3.2e+00 |
28 | 8.1e-01 |
30 | 2.0e-01 |
32 | 5.1e-02 |
34 | 1.3e-02 |
36 | 3.2e-03 |
38 | 8.0e-04 |
40 | 2.0e-04 |
This means that, if we can do 40 PCR cycles, we would be able to detect DNA in low concentrations, as low as 0.0002, which is 2 molecules in 10 liters. Amazing.
Now, what is the margin of error for this measurement? That is matter for another day.