November 26, 2019
We have our own data
survey <- read.table("survey1-tidy.txt") survey$weight
[1] 67.0 58.0 56.0 94.0 60.0 77.0 56.0 75.0 [9] 80.0 105.0 59.0 70.0 57.0 50.0 78.0 55.0 [17] 106.0 68.0 68.0 65.0 76.0 42.5 55.0 69.0 [25] 60.0 58.0 52.0 47.0 65.0 67.0 68.0 74.0 [33] 55.0 55.0 60.0 50.0 55.0 58.0 75.0 53.0 [41] 81.0 54.0 55.0 72.0 65.0 64.0 54.0 85.0 [49] 63.0 75.0 77.0
weight
?sort(survey$weight)
[1] 42.5 47.0 50.0 50.0 52.0 53.0 54.0 54.0 [9] 55.0 55.0 55.0 55.0 55.0 55.0 56.0 56.0 [17] 57.0 58.0 58.0 58.0 59.0 60.0 60.0 60.0 [25] 63.0 64.0 65.0 65.0 65.0 67.0 67.0 68.0 [33] 68.0 68.0 69.0 70.0 72.0 74.0 75.0 75.0 [41] 75.0 76.0 77.0 77.0 78.0 80.0 81.0 85.0 [49] 94.0 105.0 106.0
sort(survey$weight, decreasing=TRUE)
[1] 106.0 105.0 94.0 85.0 81.0 80.0 78.0 77.0 [9] 77.0 76.0 75.0 75.0 75.0 74.0 72.0 70.0 [17] 69.0 68.0 68.0 68.0 67.0 67.0 65.0 65.0 [25] 65.0 64.0 63.0 60.0 60.0 60.0 59.0 58.0 [33] 58.0 58.0 57.0 56.0 56.0 55.0 55.0 55.0 [41] 55.0 55.0 55.0 54.0 54.0 53.0 52.0 50.0 [49] 50.0 47.0 42.5
The command sort()
works only for vectors
To sort a data frame, we first need to choose which column we use to order
We know the position of the smallest and the largest
which.min(survey$weight)
[1] 22
which.max(survey$weight)
[1] 17
We need the positions in between
For that we use the order()
command
survey$weight
[1] 67.0 58.0 56.0 94.0 60.0 77.0 56.0 75.0 [9] 80.0 105.0 59.0 70.0 57.0 50.0 78.0 55.0 [17] 106.0 68.0 68.0 65.0 76.0 42.5 55.0 69.0 [25] 60.0 58.0 52.0 47.0 65.0 67.0 68.0 74.0 [33] 55.0 55.0 60.0 50.0 55.0 58.0 75.0 53.0 [41] 81.0 54.0 55.0 72.0 65.0 64.0 54.0 85.0 [49] 63.0 75.0 77.0
order()
to sort a data frameorder(survey$weight)
[1] 22 28 14 36 27 40 42 47 16 23 33 34 37 43 3 7 13 [18] 2 26 38 11 5 25 35 49 46 20 29 45 1 30 18 19 31 [35] 24 12 44 32 8 39 50 21 6 51 15 9 41 48 4 10 17
This gives us the position of the smallest, the second smallest, and so on up to the largest
survey[order(survey$weight),]
Gender birth_day birth_month birth_year height_cm weight_kg handness st22 Female 13 10 1997 155 42.5 Right st28 Female 7 7 1997 166 47.0 Right st14 Female 3 7 1997 160 50.0 Right st36 Female 24 3 1998 167 50.0 Right st27 Female 13 10 1997 171 52.0 Right st40 Female 5 2 1998 157 53.0 Right st42 Female 18 5 1997 165 54.0 Right st47 Female 29 7 1997 160 54.0 Right st16 Female 3 9 2018 164 55.0 Right st23 Female 2 10 1998 172 55.0 Right st33 Female 21 5 1998 168 55.0 Right st34 Female 3 9 1998 174 55.0 Right st37 Female 17 9 1998 173 55.0 Right st43 Female 23 5 1999 178 55.0 Right st3 Female 28 1 1995 170 56.0 Left st7 Female 5 4 1996 173 56.0 Right st13 Female 9 6 1998 158 57.0 Right st2 Female 9 10 1995 167 58.0 Right st26 Female 17 5 1998 165 58.0 Right st38 Female 2 1 1999 162 58.0 Right st11 Male 26 12 1997 176 59.0 Right st5 Female 1 1 1991 160 60.0 Right st25 Female 17 8 1998 163 60.0 Right st35 Female 1 9 1998 174 60.0 Right st49 Female 2 5 1999 165 63.0 Left st46 Male 6 11 1998 163 64.0 Right st20 Female 30 6 1997 158 65.0 Right st29 Male 28 7 1998 185 65.0 Left st45 Male 6 12 1997 166 65.0 Right st1 Male 1 2 1993 179 67.0 Right st30 Male 5 1 1997 178 67.0 Right st18 Female 16 11 1998 163 68.0 Right st19 Female 3 5 1998 162 68.0 Right st31 Male 27 11 1997 180 68.0 Right st24 Female 10 6 1998 159 69.0 Right st12 Male 9 2 1997 183 70.0 Right st44 Female 19 9 1997 174 72.0 Right st32 Male 29 8 1998 170 74.0 Right st8 Female 14 1 1997 162 75.0 Left st39 Male 19 11 1998 175 75.0 Right st50 Male 31 10 1998 184 75.0 Right st21 Male 15 1 2018 175 76.0 Right st6 Male 26 9 1996 175 77.0 Right st51 Male 9 3 1996 177 77.0 Right st15 Male 13 10 1998 182 78.0 Right st9 Male 1 5 1997 173 80.0 Right st41 Male 18 5 1997 181 81.0 Right st48 Male 14 3 1993 195 85.0 Right st4 Male 11 8 1992 180 94.0 Right st10 Male 25 6 1997 188 105.0 Right st17 Male 10 1 1998 175 106.0 Right hand_span_cm st22 20 st28 20 st14 15 st36 30 st27 25 st40 20 st42 18 st47 20 st16 20 st23 20 st33 14 st34 22 st37 8 st43 12 st3 18 st7 21 st13 19 st2 18 st26 19 st38 19 st11 24 st5 19 st25 15 st35 24 st49 17 st46 15 st20 8 st29 22 st45 15 st1 15 st30 24 st18 13 st19 13 st31 19 st24 18 st12 20 st44 16 st32 25 st8 18 st39 20 st50 22 st21 20 st6 18 st51 23 st15 21 st9 16 st41 20 st48 30 st4 25 st10 20 st17 15
()
library()
install.packages()
knitr
: a package for RmarkdownKnitr is the system that merges R code and Markdown to produce documents that depend on data
It has many functions. We used two of them:
knitr::kable()
is a function to produce nicer tables
pander()
from the pander packageknitr::opts_chunk$set()
to set the default options for each chunkkable()
survey[1:5,]
Gender birth_day birth_month birth_year height_cm weight_kg handness st1 Male 1 2 1993 179 67 Right st2 Female 9 10 1995 167 58 Right st3 Female 28 1 1995 170 56 Left st4 Male 11 8 1992 180 94 Right st5 Female 1 1 1991 160 60 Right hand_span_cm st1 15 st2 18 st3 18 st4 25 st5 19
kable()
library(knitr) kable(survey[1:5,])
Gender | birth_day | birth_month | birth_year | height_cm | weight_kg | handness | hand_span_cm | |
---|---|---|---|---|---|---|---|---|
st1 | Male | 1 | 2 | 1993 | 179 | 67 | Right | 15 |
st2 | Female | 9 | 10 | 1995 | 167 | 58 | Right | 18 |
st3 | Female | 28 | 1 | 1995 | 170 | 56 | Left | 18 |
st4 | Male | 11 | 8 | 1992 | 180 | 94 | Right | 25 |
st5 | Female | 1 | 1 | 1991 | 160 | 60 | Right | 19 |
library(knitr)
before using any function of the packageX:
drive (when using lab computers)Using the data from the exam
world
income population area 1 1810 31700000 653000 2 10500 2920000 28800 3 13300 38300000 2380000 4 6190 26000000 1250000 5 18900 97800 440 6 19500 42500000 2780000
value
columnvariable value 1 income 1810 2 income 10500 3 income 13300 4 income 6190 5 income 18900 6 income 19500 7 population 31700000 8 population 2920000 9 population 38300000 10 population 26000000 11 population 97800 12 population 42500000 13 area 653000 14 area 28800 15 area 2380000 16 area 1250000 17 area 440 18 area 2780000
We use the reshape2
library
melt
takes wide-format data and melts it into long-format data.
cast
takes long-format data and casts it into wide-format data.
Think of working with metal:
library(reshape2) melt(world, id=NULL)
variable value 1 income 1810 2 income 10500 3 income 13300 4 income 6190 5 income 18900 6 income 19500 7 population 31700000 8 population 2920000 9 population 38300000 10 population 26000000 11 population 97800 12 population 42500000 13 area 653000 14 area 28800 15 area 2380000 16 area 1250000 17 area 440 18 area 2780000
Consider this case
countries
country income population area 1 Afghanistan 1810 31700000 653000 2 Albania 10500 2920000 28800 3 Algeria 13300 38300000 2380000 4 Angola 6190 26000000 1250000 5 Antigua and Barbuda 18900 97800 440 6 Argentina 19500 42500000 2780000
country
is the identifierlibrary(reshape2) melt(countries, id="country")
country variable value 1 Afghanistan income 1810 2 Albania income 10500 3 Algeria income 13300 4 Angola income 6190 5 Antigua and Barbuda income 18900 6 Argentina income 19500 7 Afghanistan population 31700000 8 Albania population 2920000 9 Algeria population 38300000 10 Angola population 26000000 11 Antigua and Barbuda population 97800 12 Argentina population 42500000 13 Afghanistan area 653000 14 Albania area 28800 15 Algeria area 2380000 16 Angola area 1250000 17 Antigua and Barbuda area 440 18 Argentina area 2780000
reshape2
has several cast functions, for different structuresdcast
acast
for vector, matrix, or arraylong <- melt(countries, id="country") long
country variable value 1 Afghanistan income 1810 2 Albania income 10500 3 Algeria income 13300 4 Angola income 6190 5 Antigua and Barbuda income 18900 6 Argentina income 19500 7 Afghanistan population 31700000 8 Albania population 2920000 9 Algeria population 38300000 10 Angola population 26000000 11 Antigua and Barbuda population 97800 12 Argentina population 42500000 13 Afghanistan area 653000 14 Albania area 28800 15 Algeria area 2380000 16 Angola area 1250000 17 Antigua and Barbuda area 440 18 Argentina area 2780000
dcast(long, country~variable)
country income population area 1 Afghanistan 1810 31700000 653000 2 Albania 10500 2920000 28800 3 Algeria 13300 38300000 2380000 4 Angola 6190 26000000 1250000 5 Antigua and Barbuda 18900 97800 440 6 Argentina 19500 42500000 2780000
So far all the files we have used is structured
That is, they have rows and columns
We use read.table
and write.table
to read and write a data frame
Sometimes the data is not a table
people <- list(Ali=list(age=18, sex='M'), Bahar=list(age=19, sex='F'), valid=c(TRUE,FALSE)) people
$Ali $Ali$age [1] 18 $Ali$sex [1] "M" $Bahar $Bahar$age [1] 19 $Bahar$sex [1] "F" $valid [1] TRUE FALSE
How can we read and write lists?
There are several options to store lists into files.
A good one is YAML, which looks like this:
Ali: age: 18.0 sex: M Bahar: age: 19.0 sex: F valid: - yes - no
:
-
---
before and after the YAML codeGoogle “YAML” for more info
We use YAML for the Rmarkdown metadata. For example
--- title: "Midterm Exam" subtitle: "Computing in Molecular Biology 1" author: "Put your name here" number: STUDENT_NUMBER date: "October 25, 2018" output: html_document ---
library(yaml) write_yaml(people, "datafile.yml") persons <- read_yaml("datafile.yml") persons
$Ali $Ali$age [1] 18 $Ali$sex [1] "M" $Bahar $Bahar$age [1] 19 $Bahar$sex [1] "F" $valid [1] TRUE FALSE
references: - type: article-journal id: WatsonCrick1953 title: 'Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid' author: - family: Watson given: J. D. - family: Crick given: F. H. C. container-title: Nature volume: 171 issue: 4356 page: 737-738 issued: date-parts: - - 1953 - 4 - 25
Put all the references somewhere in the document, with ---
before and after.
[@WatsonCrick1953]
produces (Watson and Crick 1953)[@WatsonCrick1953, pp. 33-35, 38-39]
becomes (Watson and Crick 1953, 33–35, 38–39).[@WatsonCrick1953; @Collado-Vides2009a]
becomes (Watson and Crick 1953; Collado-Vides et al. 2009).@WatsonCrick1953 [p. 33]
says blah becomes Watson and Crick (1953, 33) says blahIf you have a long list of all papers, and you use it on several documents, then you should put the references in a separate file
Then you write
bibliography: references.yml
in the document metadata
Bibliographies will be placed at the end of the document. Normally, you will want to end your document like this:
last paragraph... # References
The bibliography will be inserted after this header. More info at
http://rmarkdown.rstudio.com/ authoring_bibliographies_and_citations.html
Collado-Vides, J, H Salgado, E Morett, S Gama-Castro, V Jiménez-Jacinto, I Martínez-Flores, A Medina-Rivera, L Muñiz-Rascado, M Peralta-Gil, and A Santos-Zavaleta. 2009. “Bioinformatics Resources for the Study of Gene Regulation in Bacteria.” Journal of Bacteriology 191 (1): 23–31.
Watson, J. D., and F. H. C. Crick. 1953. “Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid.” Nature 171 (4356): 737–38. https://doi.org/10.1038/171737a0.