Blog of Andrés Aravena
CMB2:

Exercises for Final Exam

10 May 2018. Deadline: Exam day, of course.

Work on this list every day without exception, at least 25 minutes without interruption. Use an alarm clock to know when to stop. Do not stop until the alarm rings. Always stop when the alarm rings and do something else for at least 5 minutes.

If you can, repeat this once every day. If you do it twice you have on hour every day, roughly equivalent to one day. That is exactly one more day than most people have studied so far, so doing this will be a huge advantage.

Things to avoid:

1. Computational thinking

1.1 Exploring vectors

You will program your own version of some standard functions using only for(), if() and indices. All the following functions receive a vector.

Please write your own version of the following functions:

  1. vector_min(x), equivalent to min(x). Returns the smallest element in x.

  2. vector_max(x), equivalent to max(x). Returns the largest element in x.

  3. vector_which_min(x), equivalent to which_min(x). Returns the index of the smallest element in x.

  4. vector_which_max(x), equivalent to which_max(x). Returns the index of the largest element in x.

  5. vector_mean(x), equivalent to mean(x). Returns the average of all elements in x.

  6. vector_cumsum(x), equivalent to cumsum(x). Returns a vector of the same length of x with the cumulative sum of x

  7. vector_diff(x), equivalent to diff(). Returns a vector one element shorter than x with the difference between consecutive elements of x.

  8. vector_apply(x, f), equivalent to sapply(x, f). Inputs are vector x and function f. Returns a new vector y of the same length of x where y[i] is f(x[i]) for all i.

You can test your function with the following code.

x <- sample(5:20, size=10, replace=TRUE)
min(x)
vector_min(x)

The two results must be the same. Obviously, you have to replace min and vector_min with the corresponding functions.

1.2 Merging vectors

Please write a function called vector_merge(x, y) that receives two sorted vectors x and y and returns a new vector with the elements of x and y together sorted. The output vector has size length(x)+length(y).

You must assume that each of the input vectors is already sorted.

For that you have to use three indices: i, j, and k; to point into x, y and the output vector ans. On each step you have to compare x[i] and y[j]. If x[i] < y[j] then ans[k] <- x[i], otherwise ans[k] <- y[j].

You have to increment i or j, and k carefully. To test your function, you can use this code:

a <- sample(letters)
x <- sort(a[1:13])
y <- sort(a[14:26])
vector_merge(x, y)

The output must be a sorted alphabet.

1.3 Sorting

Please write a function called vector_mergesort(x) that takes a single vector x and returns a new vector with the same elements of x but sorted from the smallest to the largest.

To do so you have to use a recursive strategy as follows:

  • If the input vector x has length 1, then it is already sorted. In that case the output is a copy of x
  • If the length of the input is larger than 1 then you split x in two parts. The new vector x1 contains the first half of x, and x2 has the second half.
  • Be careful when length(x) is odd.
  • Now you have to sort x1 and x2 by using the same function vector_mergesort(). Store the results in ans1 and ans2.
  • Finally you have to merge ans1 and ans2 using the function vector_merge() of the previous exercise, and return the merged vector.

2. Random processes

  1. Please write a function called my_sample(x, size, replace, prob), equivalent to the function sample(x, size, replace, prob), using only sample.int(n, size, replace, prob)

  2. Simulate an experiment with N independent dice. The result of the experiment is the sum of all dice.

    • Plot the histogram of the result for 100 replicas, for different values of N. You can write a function for this.
    • Plot the average of the results of 100 replicas, depending on different values of N such as 10, 1010, 2010, …, 2E4.
    • What is the relationship between the averages of the results and N? Build a linear model and explain the result.
    • Use the quartile(x, ...) function to find a 95% confidence interval for the result of the experiment.
  3. Simulate an experiment with N independent coins. Each side of the coins are labeled +1 and -1. The result of the experiment is the sum of all coin labels.

    • Plot the histogram of the result for 100 replicas, for different values of N.
    • Plot the average of the result depending on different values of N, like 10, 1010, 2010, …, 2E4. What is the relationship?
    • Write a function called squared_vector(N) taking N as input, simulating 400 replicas, and returning a vector with the square of each replica. For example, if the replicas are c(1,-2,0,...,-1,3), the function must return c(1,4,0,...,1,9).Hint: you can take the square before doing the replicas.

    • Plot the mean of the output of squared_vector(N) versus N for different values of N, like 10, 1010, 2010, …, 2E4.
    • What is the relationship between the mean of the squares of the results and N? Build a linear model and explain the result.
  4. How many times you have to throw a dice to get a 6? Give the average and a 95% confidence interval.For this and the following questions you can use ether quantile(x) or the formulas from Question 4.

  5. How many times you have to throw a dice to get two consecutive 6? Give the average and a 95% confidence interval.

  6. How many times you have to throw a dice to get two 6, consecutive or not? Give the average and a 95% confidence interval.

  7. How many times you have to throw two dice to get a sum equal to 6? Give the average and a 95% confidence interval.

  8. We have six lamps labeled 1 to 6. Initially they are all turned off. You trow a dice and get a number x. Then you switch the lamp that has the label x.

    How many times you have to trow the dice until all six lamps are turned on? Give a range that is valid at least 95% of times.

  9. What is the effect of the read length in the number of contigs? Assume shotgun assembly of a genome of size 1E6, and make a plot for different read lengths and number of reads.

3. Hypothesis testing: Blind test of cola normal v/s zero

We want to know if you can taste the difference between cola normal and sugarless. To test this, we prepare 8 cups that look identical. Four of the cups are filled with normal cola, the other four cups have cola zero. The 8 cups are randomly shuffled using sample.int(8). We write the shuffling order in a paper and hide it in an envelop that you cannot see.

You test all of them and you write which ones you believe are cola normal and which ones are zero. For example you can write that cups 2,3,5 and 7 have cola zero. Then we open the envelop and compare your results to the original order, and we find that you guessed correctly all of them. Then we have two possible explanations:

What is the probability of choosing correctly just by luck (i.e. under hypothesis zero)?

4. Theory: event frequency v/s event probability

  1. You want to do an experiment where the probability of an event is 0.70. How many replicas you need to guarantee that the relative frequency on the event in the experiment is between 0.65 and 0.75 at least 95% of the time? What is the formula to answer that question?
  2. You simulated a process with 100 replicas. The relative frequency of the event is 0.7. What is the 95% confidence interval for the real probability? What is the formula to answer that question?

Deadline: Exam day, of course.

Originally published at https://anaraven.bitbucket.io/blog/2018/cmb2/exercises-for-final-exam.html