These are the subjects that every student of CMB needs to know after the course. Some of these contents are evaluated on the makeup, some were already evaluated.
- To understand and value the use of computers in Science.
- Data handling and processing are essential to Science
- Computers are useful beyond the web, email and text editing. Computers are not typewriters
- Know the essential parts of computers and what makes them different
from other tools
- What is the computer memory
- Why there are different kinds of memory
- Understand the role of text files for Scientific computing
- Learn the basic rules of the R platform
- How to write commands and give orders to the computers
- How to assign values to variables and read them back
- How to use pre-existing functions
- How to add new functions using
install.packages()
andlibrary()
- Know the four basic data types of R
- numeric, character, factor and logic
- Recognize them in R output
- Know when to use each one
- Be able to create vector of each type
- How to make a numeric vector
- How to make a character vector
- The same with factor and logic vectors
- Know the four basic data structures of R
- vectors, matrices, lists, data frames
- Recognize them in R output
- Know when to use each one
- Be able to create objects of each type
- Make new vectors
- Make new matrices
- Make lists with several vectors and other lists
- Read data frames from an existing text file
- Write data frames to new text files
- Use indices to access and modify components of data
structures
- Single indices for vectors and lists
- Double indices (separated by comma) for matrices and data frames
- Empty indices mean “all the row” or “all the column”
- Positive numbers can be used as indices
- Negative numbers can be used as indices
- be sure to understand the difference
- Logical vectors as indices.
- This is the most important case
- Characters as indices
- In this case you need to give names to each element
- For vectors and lists use
names()
. You can also assign names when you create the vector or list - For matrices and data frames use
rownames()
andcolnames()
- Handling data frames
- Data frames are the most useful structures
- Each column is a vector and has a name
- You can see the names using
colnames()
orsummary()
- it is always a good idea to use
summary()
and verify if the values make sense or if there is an error - verify the minimum, maximum and if there are
NA
values
- it is always a good idea to use
- You can access any complete column using the
$
sign- This does not work in matrices
- You can change a column just assigning a new vector to an existing column
- You can add new columns just assigning a vector to a new column name
- You can delete a column just assigning
NULL
to an existing column - You can compare any column to a fixed value and get a logical vector
- Then you can use that vector as an index for the rows
- This is the most common case
- You should be able to index any column, any row and any combination of rows and columns
- Plotting vectors one by one
- You can plot vectors one by one using
plot()
,points()
andlines()
- You can choose type of plot (lines, symbols, both, none)
- You can choose colors (
col=
), symbols (pch=
) and sizes (cex=
) - You can include title, subtitle and axis names
- You can select the range to plot with
xlim=
andylim=
- You can plot vectors one by one using
- Scatter plots
- You can also plot two vectors at the same time, one on each axis
- You probably need to use index to get the correct vectors from a data frame
- You can use the same options as before to make a nicer plot
- There are many other options that you find in
help(par)
- There are many other options that you find in
- The drawing changes if one of the vectors is a factor
- When
x
is a factor you get a boxplot- This is an important case
- You need to know the meaning of the symbols
- When the vectors are numeric you can use logarithmic axes using
log=
- Make sure you understand when to use a logarithmic scale
- You can use
plot(data$x, data$y)
or a formula likeplot(y~x, data)
- When you use the formula then the axis names are automatic
- But you can always change the axis names using
xlab=
andylab=
- Always look at the data first. You have to understand the data. There is no easy way.
- You can also plot two vectors at the same time, one on each axis
- Histograms
- To understand a single vector you can see the distribution of values using a histogram
- The command
hist()
groups all values into classes and counts them - The number of classes can be controlled using
nclass=
- It is better to use
col=
to see the columns - This is just another tool you have to understand the data
In summary
The purpose of the course is to help you learn how to handle data. Only you can learn. The course is a guide.
Each problem is different. You need to understand each problem and use the best tool for each case. Memory is not important. There is no easy way.
The important part is to understand the data, see the common patterns and generalize. Learn the rules and apply them.
If you succeed in this course you will be able to apply the concepts in Molecular Biology, in Science, in any other job and in everyday life. Data is the essence of 21st century.