What "Computing in Molecular Biology" is about

These are the subjects that every student of CMB needs to know after the course. Some of these contents are evaluated on the makeup, some were already evaluated.

To understand and value the use of computers in Science.
- Data handling and processing are essential to Science
- Computers are useful beyond the web, email and text editing. Computers are not typewriters
- Know the essential parts of computers and what makes them different from other tools
  - What is the computer memory
  - Why there are different kinds of memory
- Understand the role of text files for Scientific computing
Learn the basic rules of the R platform
- How to write commands and give orders to the computers
- How to assign values to variables and read them back
- How to use pre-existing functions
- How to add new functions using install.packages() and library()
Know the four basic data types of R
- numeric, character, factor and logic
- Recognize them in R output
- Know when to use each one
- Be able to create vector of each type
  - How to make a numeric vector
  - How to make a character vector
  - The same with factor and logic vectors
Know the four basic data structures of R
- vectors, matrices, lists, data frames
- Recognize them in R output
- Know when to use each one
- Be able to create objects of each type
  - Make new vectors
  - Make new matrices
  - Make lists with several vectors and other lists
  - Read data frames from an existing text file
  - Write data frames to new text files
Use indices to access and modify components of data structures
- Single indices for vectors and lists
- Double indices (separated by comma) for matrices and data frames
  - Empty indices mean “all the row” or “all the column”
- Positive numbers can be used as indices
- Negative numbers can be used as indices
  - be sure to understand the difference
- Logical vectors as indices.
  - This is the most important case
- Characters as indices
  - In this case you need to give names to each element
  - For vectors and lists use names(). You can also assign names when you create the vector or list
  - For matrices and data frames use rownames() and colnames()
Handling data frames
- Data frames are the most useful structures
- Each column is a vector and has a name
- You can see the names using colnames() or summary()
  - it is always a good idea to use summary() and verify if the values make sense or if there is an error
  - verify the minimum, maximum and if there are NA values
- You can access any complete column using the $ sign
  - This does not work in matrices
- You can change a column just assigning a new vector to an existing column
- You can add new columns just assigning a vector to a new column name
- You can delete a column just assigning NULL to an existing column
- You can compare any column to a fixed value and get a logical vector
  - Then you can use that vector as an index for the rows
  - This is the most common case
- You should be able to index any column, any row and any combination of rows and columns
Plotting vectors one by one
- You can plot vectors one by one using plot(), points() and lines()
  - You can choose type of plot (lines, symbols, both, none)
  - You can choose colors (col=), symbols (pch=) and sizes (cex=)
  - You can include title, subtitle and axis names
  - You can select the range to plot with xlim= and ylim=
Scatter plots
- You can also plot two vectors at the same time, one on each axis
  - You probably need to use index to get the correct vectors from a data frame
- You can use the same options as before to make a nicer plot
  - There are many other options that you find in help(par)
- The drawing changes if one of the vectors is a factor
- When x is a factor you get a boxplot
  - This is an important case
  - You need to know the meaning of the symbols
- When the vectors are numeric you can use logarithmic axes using log=
  - Make sure you understand when to use a logarithmic scale
- You can use plot(data$x, data$y) or a formula like plot(y~x, data)
  - When you use the formula then the axis names are automatic
  - But you can always change the axis names using xlab= and ylab=
  - Always look at the data first. You have to understand the data. There is no easy way.
Histograms
- To understand a single vector you can see the distribution of values using a histogram
- The command hist() groups all values into classes and counts them
- The number of classes can be controlled using nclass=
- It is better to use col= to see the columns
- This is just another tool you have to understand the data

In summary

The purpose of the course is to help you learn how to handle data. Only you can learn. The course is a guide.

Each problem is different. You need to understand each problem and use the best tool for each case. Memory is not important. There is no easy way.

The important part is to understand the data, see the common patterns and generalize. Learn the rules and apply them.

If you succeed in this course you will be able to apply the concepts in Molecular Biology, in Science, in any other job and in everyday life. Data is the essence of 21st century.