Blog of Andrés Aravena
CMB2:

Comments on Midterm Exam

05 April 2018

The midterm exam has three mandatory questions and one optional. All questions point to evaluate Computational Thinking skills: decomposition, pattern recognition, abstraction and algorithm design

1. Turtle draws Snow

This question asks us to make a recursive function. Have we seen them before?

Yes! We saw examples of recursive functions, such as the factorial, the Fibonacci and the one drawing trees. We made tree drawings on Quiz 2 and also in classes.

We have seen that recursive functions are functions that call themselves. The factorial of N is N times the factorial of N-1. A tree (N levels) has trunk and branches, and each branch is a smaller tree (with N-1 levels). We also saw that all recursive functions always have an if() to handle the exit condition, typically when N==1.

So we can start with a copy of the tree() function. Here it is:

tree <- function(n, length, angle) {
    turtle_lwd(n)
    turtle_forward(length)
    if (n > 1) {
        turtle_left(angle)
        tree(n-1,length*0.8, angle)
        turtle_right(2*angle)
        tree(n-1, length*0.8, angle)
        turtle_left(angle)
        tree(n-1, length*0.8, angle)
    }
    turtle_backward(length)
}

We have to change it to draw snow instead of trees. First we change the name from tree to snow. We also change the inputs. Instead of n we use N and instead of length we use L. There are no more inputs to snow().

snow <- function(N, L) {
    turtle_lwd(N)
    turtle_forward(L)
    if (N > 1) {
        turtle_left(angle)
        snow(N-1,L*0.8, angle)
        turtle_right(2*angle)
        snow(N-1, L*0.8, angle)
        turtle_left(angle)
        snow(N-1, L*0.8, angle)
    }
    turtle_backward(L)
}

This version does not work, because it uses angle which is not defined. Also, the question says that, instead of L*0.8, each part must be of length L/3. The important parts of the question are:

The turtle draws the first part, turns left 60 degrees, draws the second part, turns 120 degrees to the right, draws another part, turns 60 degrees left, and draws the last part.
[…]
In general snow level N of length L is made of four parts of snow level N-1 of length L/3.

So we need to insert one snow level N-1 and we know the angles:

Output of the first version of snow().
snow <- function(N, L) {
    turtle_lwd(N)
    turtle_forward(L)
    if (N > 1) {
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
        turtle_right(120)
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
    }
    turtle_backward(L)
}

We can test this function and it works 😀 but the drawing is weird 🤪. The lines have different width, and the snow is outside the box. We prefer all lines of the same length, so we have to delete the turtle_lwd(N) command.

The command snow(1, 80) is ok, but the others are outside. So we need to do turtle_forward(L) only when N==1. The new version is

Output of second version of snow().
snow <- function(N, L) {
    if(N==1) {
    turtle_forward(L)
    }
    if (N > 1) {
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
        turtle_right(120)
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
    }
    turtle_backward(L)
}

This one is also weird, but at least we get most of the snow inside the box. It looks like we are drawing and moving outside. What can be wrong?

The first if() is ok, because the question says “Snow level 1 is just a straight line of length L. The second if() is also right. It does exactly what the question asks.

What about the turtle_backward(L)? The question does not say anything about “going backwards” or “return to the initial condition”. In fact each snow has to start at the end of the previous snow. Let’s delete turtle_backward(L) and see what happens.

Output of final version of snow().
snow <- function(N, L) {
    if(N==1) {
    turtle_forward(L)
    }
    if (N > 1) {
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
        turtle_right(120)
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
    }
}

Now it works. We can simplify it a last time using if() {} else {}.

snow <- function(N, L) {
    if(N==1) {
        turtle_forward(L)
    } else {
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
        turtle_right(120)
        snow(N-1, L/3)
        turtle_left(60)
        snow(N-1, L/3)
    }
}

Ready! Let’s go to the next question.

2. Algorithm design

We have to write a function, and we got this starting point:

locate_half <- function(x) {
    # write your code here
}

Have we seen this before? Maybe is like finding minimum or maximum. We have never programed our own version of min(), but it should not be so hard, isn’t?

Wait. There is min() and which.min(). This question ask for an index, so it is more like which.min(). It is not about the value of the half value, it is about the location of the half value.

The half value is the average of the minimum and the maximum. Let’s re-write this phrase in R language using the starting point:

locate_half <- function(x) {
    half_value <- (min(x) + max(x))/2
    return(half_value)
}

Is this enough? Let’s test:

x <- 1:9
locate_half(x)
## [1] 5

Good!

locate_half(x+20)
## [1] 25

Mmm…🤔 It does not look right. We got the half value, not the location of the half value. Is like doing min() instead of which.min()It cannot be that easy. The teacher does not asks easy questions. And we are not lazy students.

.

Let’s read the question again. The question says:

The location of the half value of the vector x is the index of the first value that is equal or bigger than the half value of x.

Therefore we will need an index. Let’s say that the index is i. And we need to check each one of the components of x from the first to the last. So we need a for(){} loop using i as a counter.

locate_half <- function(x) {
    half_value <- (min(x) + max(x))/2
    for(i in 1:length(x)) {
        ...
        return(...)
    }
}

For each element of x pointed by i we need to make a decision. We want to know if x[i] is bigger or equal to half_value. Every time we make a decision, there must be an if(){}. Let’s write that.

locate_half <- function(x) {
    half_value <- (min(x) + max(x))/2
    for(i in 1:length(x)) {
        if(x[i] >= half_value) {
            return(...)
        }
    }
}

Now, what shall we do if we find that x[i] >= half_value? In that case we have found the location of the half value. It is the index of x[i], in other words, it is i. We have to return that number.

locate_half <- function(x) {
    half_value <- (min(x) + max(x))/2
    for(i in 1:length(x)) {
        if(x[i] >= half_value) {
            return(i)
        }
    }
}

And we test again

x <- 1:9
locate_half(x)
## [1] 5
locate_half(x + 20)
## [1] 5
locate_half(x * x)
## [1] 7
locate_half(sqrt(x))
## [1] 4

Good! we got the expected result. Can we make it better?

Well, the question says that x is growing, so x[1]=min(x) and x[length(x)]=max(x). Using indices is faster than using functionsSpeed is not very important now, but if one day we do big data then speed is essential. The program has to finish before we get old.

so we can rewrite our function as:

locate_half <- function(x) {
    half_value <- (x[1] + x[length(x)])/2
    for(i in 1:length(x)) {
        if(x[i] >= half_value) {
            return(i)
        }
    }
}

Let’s move to the next question.

3. Systems: Polymerase Chain Reaction

The PCR system is shown in the question as this:

%3

Have we seen a similar system before?

Simulate several PCR cycles

cell_food This is the cell_culture system we saw on Class 11.

Yes! It is the same plot we got for cell growth in Class 11 and in the blog document. This is pattern recognition again. A system depends only on how many circles connect to how many boxes with how many arrows. It does not depend on the names in the circles/boxes, or the way we draw them. This is abstraction again.

So we can start with the code for cell_culture() from the class.

cell_culture <- function(N, replication_rate, cells_init, food_ini) {
  # create empty vectors
  cells <- d_cells <- rep(NA, N)
  food  <- d_food  <- rep(NA, N)
  # initialize values  
  cells[1]   <- cells_init
  food[1]    <- food_ini
  d_cells[1] <- d_food[1] <- 0
  for(i in 2:N) {
    # update `d_cells` and `d_food`
    d_cells[i] <- +replication_rate * cells[i-1] * food[i-1]
    d_food[i]  <- -replication_rate * cells[i-1] * food[i-1]
    # update `cells` and `food` as a cumulative sum
    cells[i] <- cells[i-1] + d_cells[i]
    food[i]  <- food[i-1]  + d_food[i]
  }
  return(data.frame(cells, food))
}

We have to change the name of the function. We also change the names of the variables, from cells to DNA, from food to primers and from eating to cycle. We change also the names for the variables with names like d_cells. This is easy to do with the Search & Replace option in the editor. Also, replication_rate becomes cycle_rate, or just rate, as the question indicates. We got this:

pcr <- function(N, rate, dna_ini, primers_ini) {
  # create empty vectors
  dna     <- d_dna     <- rep(NA, N)
  primers <- d_primers <- rep(NA, N)
  # initialize values  
  dna[1]     <- dna_ini
  primers[1] <- primers_ini
  d_dna[1]   <- d_primers[1] <- 0
  for(i in 2:N) {
    # update `d_dna` and `d_primers`
    d_dna[i]      <-  +rate * dna[i-1] * primers[i-1]
    d_primers[i]  <-  -rate * dna[i-1] * primers[i-1]
    # update `dna` and `primers` as a cumulative sum
    dna[i]      <-  dna[i-1] + d_dna[i]
    primers[i]  <-  primers[i-1]  + d_primers[i]
  }
  return(data.frame(dna, primers))
}

Now we need to include the default values for rate and primer_ini.

pcr <- function(N, dna_ini, primers_ini=1e8, rate=1e-8) {
  # create empty vectors
  dna     <- d_dna     <- rep(NA, N)
  primers <- d_primers <- rep(NA, N)
  # initialize values  
  dna[1]     <- dna_ini
  primers[1] <- primers_ini
  d_dna[1]   <- d_primers[1] <- 0
  for(i in 2:N) {
    # update `d_dna` and `d_primers`
    d_dna[i]      <-  +rate * dna[i-1] * primers[i-1]
    d_primers[i]  <-  -rate * dna[i-1] * primers[i-1]
    # update `dna` and `primers` as a cumulative sum
    dna[i]      <-  dna[i-1] + d_dna[i]
    primers[i]  <-  primers[i-1]  + d_primers[i]
  }
  return(data.frame(dna, primers))
}

Now we test and see if we did it right.

pcr(N=6, dna_ini=1e6)
##        dna   primers
## 1  1000000 100000000
## 2  2000000  99000000
## 3  3980000  97020000
## 4  7841396  93158604
## 5 15146331  85853669
## 6 28150012  72849988

Yup! It’s done. And we only used Search & Replace and pattern recognition.

PCR depends on initial concentration

This question asks us to write code directly, not any function. It gives us some initial data:

initial_dna <- 10**(0:6)
initial_dna
## [1] 1e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06

For each of these values we need to simulate 30 PCR cycles and get the dna value. Let’s see if we can do just one first.

pcr(N=30, dna_ini=initial_dna[1])
##             dna     primers
## 1  1.000000e+00 100000000.0
## 2  2.000000e+00  99999999.0
## 3  4.000000e+00  99999997.0
## 4  8.000000e+00  99999993.0
## 5  1.600000e+01  99999985.0
## 6  3.200000e+01  99999969.0
## 7  6.399998e+01  99999937.0
## 8  1.279999e+02  99999873.0
## 9  2.559997e+02  99999745.0
## 10 5.119987e+02  99999489.0
## 11 1.023995e+03  99998977.0
## 12 2.047979e+03  99997953.0
## 13 4.095916e+03  99995905.1
## 14 8.191665e+03  99991809.3
## 15 1.638266e+04  99983618.3
## 16 3.276263e+04  99967238.4
## 17 6.551454e+04  99934486.5
## 18 1.309861e+05  99869014.9
## 19 2.618007e+05  99738200.3
## 20 5.229161e+05  99477084.9
## 21 1.043098e+06  98956903.3
## 22 2.075315e+06  97924686.1
## 23 4.107561e+06  95892440.5
## 24 8.046401e+06  91953600.4
## 25 1.544536e+07  84554645.4
## 26 2.850512e+07  71494879.8
## 27 4.888482e+07  51115177.6
## 28 7.387239e+07  26127613.3
## 29 9.317348e+07   6826521.5
## 30 9.953399e+07    466013.9

Ok, we got something. Maybe too much, since we got dna and primers. We only want dna. Since the result is a data frame, we can use any kind of index to get the first column, such as $dna or [[1]] or [,"dna"]. Data frames can be used in many different ways, as we learned last semesterDid we learn that? At least it was in the lectures.

.

Let’s try again

pcr(N=30, dna_ini=initial_dna[1])$dna
##  [1] 1.000000e+00 2.000000e+00 4.000000e+00 8.000000e+00 1.600000e+01
##  [6] 3.200000e+01 6.399998e+01 1.279999e+02 2.559997e+02 5.119987e+02
## [11] 1.023995e+03 2.047979e+03 4.095916e+03 8.191665e+03 1.638266e+04
## [16] 3.276263e+04 6.551454e+04 1.309861e+05 2.618007e+05 5.229161e+05
## [21] 1.043098e+06 2.075315e+06 4.107561e+06 8.046401e+06 1.544536e+07
## [26] 2.850512e+07 4.888482e+07 7.387239e+07 9.317348e+07 9.953399e+07

Now we got a vector. We need to do it seven times. We can do it one by one.

conc <- data.frame(V1=pcr(N=30, dna_ini=initial_dna[1])$dna,
                   V2=pcr(N=30, dna_ini=initial_dna[2])$dna,
                   V3=pcr(N=30, dna_ini=initial_dna[3])$dna,
                   V4=pcr(N=30, dna_ini=initial_dna[4])$dna,
                   V5=pcr(N=30, dna_ini=initial_dna[5])$dna,
                   V6=pcr(N=30, dna_ini=initial_dna[6])$dna,
                   V7=pcr(N=30, dna_ini=initial_dna[7])$dna)

It is not very elegant, but it works. Let’s test it.

plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
     ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
    points(conc[[i]], pch=i, type="b")
}

We are ready. We can move forward.

But we can do better. If we still have time, we can improve this code to be more generic. We would like to have a program that works for any number of initial conditions. We have go step by step.

First we need an “empty” data frame. We cannot make a completely empty data frame, but we can start with a single column with an empty vector. Each simulation has 30 steps, so we need a vector of size 30.

conc <- data.frame(V1=rep(NA, 30))

Now we need a loop for each element of initial_dna:

for(i in 1:length(initial_dna)) {
    ...
}

Inside the loop we have to fill each of the columns of conc. Can we do that, even if we have only one column? Yes, since data frames can grow. Last semester we saw that it was quite easy to add new columns to a data frame. We just use an index and assign a vector. If the indexed column does not exist on the data frame, it is created automagically. Let’s do it.

for(i in 1:length(initial_dna)) {
    conc[[i]] <- pcr(N=30, dna_ini=initial_dna[i])$dna
}

Let’s see if this works:

plot(x=c(1,30), y=c(0, max(conc)), type="n", xlab="PCR cycle",
     ylab="DNA Concentration", main="Effect of initial DNA on PCR")
for(i in 1:ncol(conc)) {
    points(conc[[i]], pch=i, type="b")
}

This curves can be measured in real PCR experiments. In some cases we can measure the optical density (OD) through the PCR cycle. In other cases we can use fluorescence. Both methods give us a signal that is proportional to DNA concentration. If the real curves match our model, then our model is useful.

Now we can go home.

4. Bonus: Quantitative PCR

Bonus! This is optional, but it gives more points and more knowledge. Also, it makes us prouder of our accomplishments. So we do not go home yet.

Here again we have to write code, not a function. We have make a vector called CT, which contains the half value of each column of the data frame conc. We find the half value with function locate_half() which we already created. And we have to do it for each column of conc. Like in the previous question we can do it one by one:

CT <- c(locate_half(conc[[1]]),
        locate_half(conc[[2]]),
        locate_half(conc[[3]]),
        locate_half(conc[[4]]),
        locate_half(conc[[5]]),
        locate_half(conc[[6]]),
        locate_half(conc[[7]]))

Hey, doing it one by one is boring 😒 and it makes me feel like I’m working for the computer 😖, like in a dystopian movie. Let’s make the computer work for us. If we have to do the same thing more than two times, we write a for(){} loop to do it for us. Life is short, let’s leave the boring stuff for the computers.

We use the same pattern we have done many times. We create an empty vector of the correct sizeWe can even be wrong about the size of the vector. If we make a small vector it will grow as needed later. But is more efficient to do it right from the beginning.

and then we use a loop.

If we know CT we can find the initial DNA concentration
CT <- rep(NA, ncol(conc))
for(i in 1:ncol(conc)) {
    CT[i] <- locate_half(conc[[i]])
}

Then we draw the plot and we see the points in a “straight” line.

plot(initial_dna ~ CT, log="y")

This result is important because it means that there is a relationship between the initial DNA concentration and the number of cycles required to cross a fixed threshold. The name CT means cycle threshold.

If we know this relationship, we can determine with high precision what was the initial DNA concentration. That is the idea of Quantitative PCR.

Let’s find that relationship We remember that straight lines can be modeled with linear modelsSo linear models are part of molecular biology? Yes, very much. Good that we learned them last semester.

. The name says it all. The plot uses log in the vertical coordinate, so the linear model also needs logarithmic scale. We can use any logarithm, since all do the trick. In this case we can use log10 since it is easier to understand.

model <- lm(log10(initial_dna) ~ CT)
model
## 
## Call:
## lm(formula = log10(initial_dna) ~ CT)
## 
## Coefficients:
## (Intercept)           CT  
##      8.3241      -0.3006

Since Intercept is 8.3241 then the concentration corresponding to CT=0 is \(10^{8.3241}\)=2.1091137 × 108. With so much initial DNA the threshold is crossed before we start the experiment.

The slope constant for CT is -0.3, meaning that when CT increases by one, the initial DNA concentration decreases \(10^{-0.3}\)=0.5011872 times. That is like half. In other words, double DNA means one less cycle. This means that the PCR reaction duplicates the number of DNA molecules on each cycle. That happens only when we have 100% efficiency.

Let’s make a table:

ans <- data.frame(CT = seq(from=0, to=40, by=2))
ans$initial_DNA <- 10^predict(model, newdata = ans)
knitr::kable(ans, format.args=list(digits=2))
CT initial_DNA
0 2.1e+08
2 5.3e+07
4 1.3e+07
6 3.3e+06
8 8.3e+05
10 2.1e+05
12 5.2e+04
14 1.3e+04
16 3.3e+03
18 8.2e+02
20 2.1e+02
22 5.2e+01
24 1.3e+01
26 3.2e+00
28 8.1e-01
30 2.0e-01
32 5.1e-02
34 1.3e-02
36 3.2e-03
38 8.0e-04
40 2.0e-04

This means that, if we can do 40 PCR cycles, we would be able to detect DNA in low concentrations, as low as 0.0002, which is 2 molecules in 10 liters. Amazing.

Now, what is the margin of error for this measurement? That is matter for another day.

Originally published at https://anaraven.bitbucket.io/blog/2018/cmb2/comments-on-midterm-exam.html