- GEO
- Entity - Relationship Models
- Accessing GEO from R
- Exercises with R
- Desired difficulty
- Error minimization
February 16, 2016
It is an international public repository that archives and distributes gene expression data submitted by the research community
The minimum information that should be included when describing a microarray experiment.
Many journals and funding agencies require microarray data to comply with MIAME.
GEO encourage submitters to supply MIAME compliant data.
Homework: Write down the new words or concepts and Google them
All these entities are related to each other
GEO is an example of a pattern that is important in Data Science:
Data can have a structure that help us to manage and understand it
Eg. The genes of an organism have expression levels that depending on the growth medium
The genes of an organism have expression levels that depending on the growth medium
Entities and their attributes
Relationships and attributes
It is important to declare the possible number of relationships for each entity instance
For a given relationship between entities A and B
Technical names: Surjective and Injective relationships. Check them
The genes of an organism have expression levels that depending on the growth medium
Each instance of an entity should have an attribute to identify it
Is “given name” a good identifier for a person?
Identify the entities and relationships on the following cases.
Draw E-R diagrams for each of them
A Platform record is composed of a summary description of the array or sequencer and, for array-based Platforms, a data table defining the array template.
Each Platform record is assigned a unique and stable GEO accession number (GPLxxx).
A Platform may reference many Samples that have been submitted by multiple submitters.
Describe how a single sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it.
Each Sample record is assigned a unique and stable GEO accession number (GSMxxx).
A Sample entity must reference only one Platform and may be included in multiple Series.
A Series record links together a group of related Samples and provides a focal point and description of the whole study.
Series records may also contain tables describing extracted data, summary conclusions, or analyses.
Each Series record is assigned a unique and stable GEO accession number (GSExxx).
A Profile consists of the expression measurements for an individual gene across all Samples in a DataSet. Profiles can be searched using the GEO Profiles interface.
For more information, see About GEO Profiles page.
Ask Google
Notice that the second answer is not related to us
Remember that R can be extended with packages
library(package)
Anybody can make new packages. Most of them are found in a repository
A set of libraries used to analyze molecular biology results
GEOquery is a package found in Bioconductor
Please follow the instructions on the webpage to install this package
getGEO
functiongetGEO(GEO = NULL, filename = NULL, destdir = tempdir(), GSElimits=NULL, GSEMatrix=TRUE, AnnotGPL=FALSE, getGPL=TRUE)
GDS2225
, GSE3541
, GSM81022
, GPL341
destdir
parameter to avoid excessive Internet usagePractice at home. We will need this next week
A situation which
“That a learning difficulty can be desirable in the long run is counterintuitive; students and instructors typically conflate immediate performance with longterm learning and therefore strive to avoid impediments to performance.”
Many studies whow that doing things the easy way results in bad results
It is a good idea for manual and repetitive work
It is a very bad idea for creative and intellectual work
Remember: Thinking is like the gym
- It is hard if you don’t practice
- We are lazy and we avoid it
- We improve with practice
- Then we like it. And we look better
In statistics the word “error” does not mean “mistake”.
Mistakes are about us. Errors are the difference between “models” and “reality”.
If we have a vector x
and we want to find a single value y
to “represent” it, how do we choose y
?
Since usually all x[i]
are different, there will be an error if we represent them with a single y
.
The best y
will be the one which has the lowest error.
How do we measure this error?
For example, let’s consider the parity values from birth.txt
> x <- birth$parity
What is x
?
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 1.000 2.000 2.611 4.000 9.000
This result shows that x
is a vector (why?)
Now we choose any arbitrary value for y
> y <- 3
This is a single value, while x
has many
> length(x)
[1] 694
> length(y)
[1] 1
What is x-y
?
> summary(x-y)
Min. 1st Qu. Median Mean 3rd Qu. Max. -2.000 -2.000 -1.000 -0.389 1.000 6.000
> length(x-y)
[1] 694
So, a vector minus a number is still a vector
But the error should be a positive number. Let’s try again
What is (x-y)*(x-y)
?
> summary( (x-y)*(x-y) )
Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 1.000 1.000 2.703 4.000 36.000
> length( (x-y)*(x-y) )
[1] 694
So, a vector times itself is still a vector
And now all numbers are all positives
Let’s average all the individual errors to get a mean error
> mean( (x-y)*(x-y) )
[1] 2.70317
This is the Mean Quadratic Error of y=3
respect to x
How can we generalize this for different values of y
?
In Math and Informatics, a function is a “black box”
A rule to transform the input elements into an output
The same input should produce always the same output
Notice that there may be more than one input element
In R functions are a type of data. We have
To create a function we need to asign it to a variable
> newFunc <- function(input) { + commands + commands + return(output) + }
The keyword return
can be omitted. The function output is the result of the last command
> my.func <- function(a, b=2) {return(a*b)}
then
> my.func(3, 3)
[1] 9
> my.func(3)
[1] 6
> my.func()
Error in my.func(): argument "a" is missing, with no default
> my.func(1,2,3,4)
Error in my.func(1, 2, 3, 4): unused arguments (3, 4)
Write a function quadratic.error(y, x)
x
input should be optional with default birth$parity
Test it with different values of y
> err <- quadratic.error(1.5)
What is err
?
We would like to calculate the quadratic error for several values of y
For example,
> many.y <- seq(from=min(x), to=max(x), length.out=100)
Now we do
> err <- sapply(many.y, quadratic.error)
What is err
?
> plot(err ~ many.y, type="l")
The best y
is the one in the minimum
> min(err)
[1] 2.551838
> many.y[which.min(err)]
[1] 2.616162
> mean(x)
[1] 2.610951
> plot(err ~ many.y, type="l") > abline(v=many.y[which.min(err)]) > abline(v=mean(x))
> plot(err ~ many.y, type="l", xlim=c(2, 3), ylim=c(0, 5)) > abline(v=many.y[which.min(err)]) > abline(v=mean(x))
One common way to write the error as \[E(y) = \frac{1}{N}\sum_{i=1}^N (x_i-y)^2\] This is the same we wrote in r language
> mean((x-y)*(x-y))
or
> sum((x-y)*(x-y))/N
How do we find the minimum?
Which concepts are not clear?
Show me your calendar
After reading it, do you want to have perfect memory? Why?
Practice the usage of GEOquery
We will use both on next class