April 13, 2018
Try this
sample(c("H","T"), size=10, replace=TRUE)
[1] "T" "T" "H" "T" "H" "H" "H" "H" "H" "H"
Each element can appear several times
Shuffle, take one, replace it on the set
Most of times we will use sample()
with replace=TRUE
table(sample(c("a","c","g","t"), size=40, replace=TRUE))
a c g t 9 9 12 10
table(sample(c("a","c","g","t"), size=40, replace=TRUE))
a c g t 10 16 6 8
table(sample(c("a","c","g","t"), size=40, replace=TRUE))
a c g t 8 10 13 9
Each result is different
table(sample(c("a","c","g","t"), size=400, replace=TRUE))
a c g t 102 103 106 89
table(sample(c("a","c","g","t"), size=400, replace=TRUE))
a c g t 89 103 99 109
table(sample(c("a","c","g","t"), size=400, replace=TRUE))
a c g t 102 110 86 102
When size
increases, the frequency of each letter also increases
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
a c g t 1054 972 994 980
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
a c g t 1031 1076 974 919
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))
a c g t 1050 1012 994 944
When size
increases, the frequencies change less
table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
a c g t 10011 10009 10027 9953
table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
a c g t 9983 10025 9878 10114
table(sample(c("a","c","g","t"), size=40000, replace=TRUE))
a c g t 10012 9954 10092 9942
Each frequency is very close to 1/4 of size
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
a c g t 99546 100646 99717 100091
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
a c g t 100154 100070 99835 99941
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))
a c g t 99841 99757 100285 100117
If size
increases a lot, the relative frequencies are 1/4 each
The sum of all absolute frequencies is the Number of cases
The sum of all relative frequencies is always 1.
table(sample(c("a", "c", "g", "t"), size=1000000, replace=TRUE))/1000000
a c g t 0.250086 0.249995 0.250020 0.249899
What will be each relative frequency when size
is
BIG
In this case we can find it by thinking
This ideal relative frequency is called Probability
Each device or random system may have some preferred outcomes
All outcomes are possible, but some can be probable
In general we do not know each probability
But we can estimate it using the relative frequency
That is what we will do in this course
Population is a very very big set of things
You can assume that population size is infinite
We can do and redo an experiment for ever
(we just need a lot of money and grandchildren)
We can throw a dice 🎲 forever
All the results are a population
For example
⚀ ⚁ ⚂ ⚃ ⚄ ⚅
A, C, G, T
For example
A, C, C, G, G, T
A, A, C, C, C, G, G, G, T
For A, C, C, G, G, T the proportions are
P(A)=1/6, P(C)=1/3, P(G)=1/3, P(T)=1/6
What about A, A, C, C, C, G, G, G, T?
Normally it is easy to know the possible outcomes
Normally it is hard to know the probabilities
Knowing the probabilities is knowing the population
Probabilities describe what we know about the population
If we know the probabilities, we know something about
This is why we do Science
If we make a single experiment, we learn about the experiment
But we do not learn the truth about the population
We need several experiments to learn about the population
For example someone can say
“My grandpa smoked and lived 102 years”
Does that mean that smoking is healthy?
Medicine cares about people. Science cares about knowledge
Each patient is an individual case
Scientific knowledge is useful for medicine
But medicine is about healing each one
Science is about everybody, not each one
There are two different ways of figuring out probabilities
Here we will do (mostly) the second way
Each experiment gives us some outcomes
They are random but connected to the population
A Sample is a small part of population
Some people even say “empirical probabilities”
table(sample(c("a","c","g","t"), size=40, replace=TRUE))/40
a c g t 0.250 0.225 0.250 0.275
table(sample(c("a","c","g","t"), size=4000, replace=TRUE))/4000
a c g t 0.25425 0.24550 0.25000 0.25025
table(sample(c("a","c","g","t"), size=400000, replace=TRUE))/400000
a c g t 0.2502300 0.2501450 0.2496825 0.2499425
The result of our experiments give us empirical frequencies
They are close to 1/4, the theoretical probabilities
When size
is bigger, the empirical frequencies are closer and closer to the real probabilities
We know for sure that when size
grows we will get the probabilities
But size
has to be really big
How can be really sure that when size
grows we will get the probabilities?
How do we know?
We know because people has proven a Theorem
It is called Law of Large Numbers
Mathematics is not really about numbers
Mathematics is about theorems
Finding the logical consequences of what we know
But it is all in our mind
Experiments give us Nature without Truth
Math gives us Truth without Nature
Science gives us Truth about Nature
The main consequence of the Law of Large Numbers is
Samples tell us something about populations
Therefore we can learn about populations if we do experiments
In our course experiment means sample(x, size, replace=TRUE)
sample(c("a","c","g","t"), size, replace=TRUE)
sample(seqinr::a(), size, replace=TRUE)`
sample(c("AA","Aa","aa"), size, replace=TRUE)`
In that case instead of writing
sample(1:100, size, replace=TRUE)`
we can write
sample.int(100, size, replace=TRUE)`
(replace 100 by any natural number)
We throw two dice. What is the sum 🎲 +🎲 ?
dice <- function() { return(sample.int(6, size=1, replace=TRUE)) } dice() + dice()
[1] 7
Try it. What is your result?
One experiment is meaningless
We need to replicate the experiment
We can use replicate(n, expression)
replicate(15, dice() + dice())
[1] 5 12 7 9 5 4 7 4 6 4 3 7 8 3 5
replicate(200, dice() + dice())
[1] 7 5 6 4 8 3 7 11 9 8 8 9 7 4 3 8 6 7 5 9 8 7 4 6 6 [26] 9 3 3 7 4 5 10 5 12 2 9 7 10 3 8 7 9 8 6 4 9 5 8 4 8 [51] 5 10 6 9 8 8 3 9 9 8 4 11 8 10 6 8 5 5 9 6 6 5 5 7 10 [76] 6 9 8 3 7 10 3 9 11 8 6 2 4 10 4 7 7 8 6 10 7 9 7 8 6 [101] 10 6 7 9 5 10 8 3 6 7 9 6 4 11 8 3 7 6 11 5 8 4 7 7 11 [126] 2 7 8 3 5 7 8 7 8 10 4 2 6 4 4 6 2 5 11 6 10 6 9 7 9 [151] 5 7 8 8 9 3 9 9 8 7 7 8 3 8 5 11 8 7 4 3 4 7 7 10 10 [176] 9 2 9 11 7 9 6 8 3 5 7 7 3 10 6 7 6 6 8 8 7 3 9 7 4
table(replicate(200, dice() + dice()))
2 3 4 5 6 7 8 9 10 11 12 5 10 22 25 28 30 18 22 21 11 8
table(replicate(200, dice() + dice()))/200
2 3 4 5 6 7 8 9 10 11 12 0.045 0.055 0.115 0.115 0.120 0.140 0.135 0.105 0.080 0.045 0.045
barplot(table(replicate(200, dice() + dice()))/200)
dice() + dice()=7
Our approximate answer is
prob <- table(replicate(200, dice() + dice()))/200 prob["7"]
7 0.22
prob["7"]
is not prob[7]
Remember that we can use text and numbers as indices.
Here prob
is
prob
2 3 4 5 6 7 8 9 10 11 12 0.020 0.055 0.080 0.095 0.105 0.220 0.180 0.095 0.060 0.070 0.020
What is prob[12]
? Be careful