It may take some time to find the ultimate question
We are used to look at the procedure, not the final outcome
Hint: ask “why do we want to know this?”
Ask “why” five times
It is important to be explicit here
This is Hooks’ law
It is important to be explicit here
It is important to be explicit here
We may be tempted to draw this
But this limit us to a fixed gene. Better think big
and we want to see inside the “?” boxes
Thousands of years of experience show that it is good to write a formula
For example, coil length \(l\) depends on the applied force \(f\)
\[l(f)\]
We should see the formula as another view of the drawing
The weight \(w\) depends on the sex \(s\) and the height \(h.\) \[w(h, s)\]
The weight \(w\) depends on the sex \(s\) and the height \(h.\) \[w(h, s)\]
Then we can answer questions like \[Δw(h) = w(h, “Male”) - w(h, “Female”)\] and we realize that our answer must depend on the height
Gene expression \(e\) depends on age \(a,\) diet \(d,\) and gene \(g\) \[e(a, d, g)\]
Then the change in gene expression due to diet is \[Δe(a, g) = e(a, “AL”, g) - e(a, “CR”, g)\]
We can see that \(Δe\) depends only on \(a\) and \(g\)
The relationship is true for all \(a\) and \(g\)
If we know the function inside \[e(a, d, g)\] we could answer many questions
We will try later to find that formula
For this class we do not care what is inside each function
Just how they are related to the questions we ask
They are important tools of communication
With your collaborators
With your readers
With yourself
number in the scale at some time during the day
light intensity in a microarray
CT values for samples taken at different times
number of centimeters in a measuring tape
As before, let’s be explicit about the dependencies
For example, we measure weight \(w_M(h,s,r,t)\) of a person with a given height \(h\) and sex \(s\) in several replicas \(r\) using technique \(t\)
Technique here means the experimental procedure, such as the scale (weighing apparatus) used
We want to know the true relationship \(w(h,s)\)
But we cannot see it directly
We can only see the experimental results
Therefore we need to understand how they are connected
The real value \(w\) “plus” the variability \(v\) \[w_M(h,s,r,t) = w(h,s)⊕v(h,s,r,t)\]
The ⊕
symbol may be a +
or a ×
or something else
We will find the correct one later
For now we take it as the normal sum +
The value will be different for each replica and for each technique
To get rid of the technique variability, we normalize our results
Normalized data depends on the real value \(w\) “plus” the variability \(v\) \[w_N(h,s,r) = w(h,s)⊕v(h,s,r)\]
After normalization, all variability is random
We will see that this variability has two sources: noise and diversity
For a coil the variability is easy \(l_N(f,r) = l(f) + v(r)\)
The true function \(l(f)\) is simply \(k⋅f\)
The only variability comes from the measurement error
In other words, it is noise
Here \(v\) represents the noise
Typically noise follows a normal distribution with mean 0 \[v(r) \sim \mathcal{N}(0,σ^2)\]
The variance is a measure of the instrument quality
Better instruments have smaller \(σ^2\)
Often the noise is independent of the value measured
(but not always)
In this case we use the classical statistical tools
For example we take the average of \(n\) replicas \[\frac{1}{n}\sum_{r=1}^{n} l_N(f,r) = l(f) + \frac{1}{n}\sum_r v(r)\]
We will find that \[\frac{1}{n}\sum_{r=1}^{n} l_N(f,r) \sim \mathcal{N}(l(f), σ^2/n)\]
Everything that we measure has a margin of error
We should consider the margin of error on every step of the analysis
Better instruments, and technical replicas, give narrower intervals
and a narrow interval is good
The instrumental noise is not avoided with normalization
The good protocol is to measure several times,
and take the average
That reduces the noise level \(\sigma^2\)
But that may not be the most important part
Every individual is different, probably due to many reasons
When we measure the weight of a person, the weight depends on the biological diversity \(b\) and on the noise \(v\) \[w_N(h,s,r) = w(h,s)⊕v(r)⊕b(h,s,r)\]
The biological diversity is often much larger than the noise
And it may not follow a Normal distribution
The real challenge comes from the biological diversity
Even with perfect instruments (without noise), we have \[w_N(h,s,r) = w(h,s)⊕b(h,s,r)\] so \(w(h,s)\) represents the average case for our population
The average may not be very common
Measured = Real ⊕ Diversity ⊕ Instrument ⊕ Noise