October 10th, 2016
Many disciplines, including Molecular Biology and Genetics, have become more and more data driven.
Starting now, we will use RStudio, a free software for data analysis
Most users of R are molecular biologists, but it is also used by economists, psychologists and marketing specialists
You have to install R and RStudio in your computer
You have to execute RStudio. Then
RStudio, as almost all serious programs, is controlled by the keyboard
The mouse can be used for some shortcuts, but the real deal is the keyboard
A goal of this course is to become comfortable with the keyboard
These tools are for people who read books and don’t watch TV
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" Copyright (C) 2016 The R Foundation for Statistical Computing
…
Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
This >
symbol is called prompt
From “New Oxford American Dictionary”
and repeat
In Rstudio you can press TAB and get superpowers!
You can also repeat and edit previous commands using the arrows
You can delete all the line using Escape
Each phrase in a program is imperative.
Involves nouns, verbs and adverbs
Today we will focus on nouns
The first verb we need today is assign <-
Every object in R has 2 important properties:
Nouns are names of objects
To handle objects we give them names
We “store” the objects in variables
If we don’t give a name to an object, it is lost for ever
> rivers
[1] 735 320 325 392 524 450 1459 135 465 600 330 336 280 315 [15] 870 906 202 329 290 1000 600 505 1450 840 1243 890 350 407 [29] 286 280 525 720 390 250 327 230 265 850 210 630 260 230 [43] 360 730 600 306 390 420 291 710 340 217 281 352 259 250 [57] 470 680 570 350 300 560 900 625 332 2348 1171 3710 2315 2533 [71] 780 280 410 460 260 255 431 350 760 618 338 981 1306 500 [85] 696 605 250 411 1054 735 233 435 490 310 460 383 375 1270 [99] 545 445 1885 380 300 380 377 425 276 210 800 420 350 360 [113] 538 1100 1205 314 237 610 360 540 1038 424 310 300 444 301 [127] 268 620 215 652 900 525 246 360 529 500 720 270 430 671 [141] 1770
Also known as categorical variables.
They are used for discrete values, for example when there is no natural order
These are variables that you would never average
> state.name
[1] "Alabama" "Alaska" "Arizona" "Arkansas" [5] "California" "Colorado" "Connecticut" "Delaware" [9] "Florida" "Georgia" "Hawaii" "Idaho" [13] "Illinois" "Indiana" "Iowa" "Kansas" [17] "Kentucky" "Louisiana" "Maine" "Maryland" [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi" [25] "Missouri" "Montana" "Nebraska" "Nevada" [29] "New Hampshire" "New Jersey" "New Mexico" "New York" [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma" [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina" [41] "South Dakota" "Tennessee" "Texas" "Utah" [45] "Vermont" "Virginia" "Washington" "West Virginia" [49] "Wisconsin" "Wyoming"
> state.abb
[1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" [15] "IA" "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" [29] "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" [43] "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
> state.area
[1] 51609 589757 113909 53104 158693 104247 5009 2057 58560 58876 [11] 6450 83557 56400 36291 56290 82264 40395 48523 33215 10577 [21] 8257 58216 84068 47716 69686 147138 77227 110540 9304 7836 [31] 121666 49576 52586 70665 41222 69919 96981 45333 1214 31055 [41] 77047 42244 267339 84916 9609 40815 68192 24181 56154 97914
> state.area > 80000
[1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE [12] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE [23] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE [34] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [45] FALSE FALSE FALSE FALSE FALSE TRUE
> state.region
[1] South West West South West [6] West Northeast South South South [11] West West North Central North Central North Central [16] North Central South South Northeast South [21] Northeast North Central North Central South North Central [26] West North Central West Northeast Northeast [31] West Northeast South North Central North Central [36] South West Northeast Northeast South [41] North Central South South West Northeast [46] South West South North Central West Levels: Northeast South North Central West
> c(1,2,3)
[1] 1 2 3
> c(10,20)
[1] 10 20
The function c()
takes many values and makes a single vector
All values should be of the same type
Question for home: what happen if they have different type?
> x <- c(1,2,3) > y <- c(10,20)
We use the <-
operator for assignment.
> x
[1] 1 2 3
> y
[1] 10 20
x
and y
are two numeric vectors. We can concatenate them
> c(x, y, 5)
[1] 1 2 3 10 20 5
> c(TRUE, TRUE, FALSE, TRUE)
[1] TRUE TRUE FALSE TRUE
We can also write c(T,T,F,T)
A comparison creates a logical vector
> weight <- c(60, 72, 57, 90, 95, 72) > weight > 25
[1] TRUE TRUE TRUE TRUE TRUE TRUE
Same idea. Concatenation
Each element must be between single or double quotes
> c("alpha", 'beta', "gamma")
[1] "alpha" "beta" "gamma"
> c('he said "yes"', "I don't know")
[1] "he said \"yes\"" "I don't know"
Special characters are coded with two symbols: \"
, \\
, \n
, \t
Easy. Any character vector can be transformed into a factor
> chr.vector <- c("female", "male", "male", "female", "male", "male", "female", "female") > chr.vector
[1] "female" "male" "male" "female" "male" "male" "female" "female"
> fact.vector <-factor(chr.vector) > fact.vector
[1] female male male female male male female female Levels: female male
> 4:9
[1] 4 5 6 7 8 9
> seq(4,9)
[1] 4 5 6 7 8 9
> seq(4,10,2)
[1] 4 6 8 10
> seq(from=4, by=2, length=4)
[1] 4 6 8 10
> rep(1,3)
[1] 1 1 1
> rep(c(7,9,13), 3)
[1] 7 9 13 7 9 13 7 9 13
> rep(c(7,9,13), 1:3)
[1] 7 9 9 13 13 13
> rep(1:2,c(10,5))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
> rep(c(TRUE,FALSE),3)
[1] TRUE FALSE TRUE FALSE TRUE FALSE
> rep(c(TRUE,FALSE),c(3,3))
[1] TRUE TRUE TRUE FALSE FALSE FALSE
> c(NA,TRUE, FALSE)
[1] NA TRUE FALSE
> c(NA,1,2)
[1] NA 1 2
Every element can have a name
> weight <- c(Ali=60, Deniz=72, Fatma=57, Emre=90, Volkan=95, Onur=72) > names(weight)
[1] "Ali" "Deniz" "Fatma" "Emre" "Volkan" "Onur"
> height <- c(1.75,1.80,1.65,1.90,1.74, 1.91) > names(height) <- names(weight)
To get the i-th element of a vector v
we use v[i]
> weight[3]
Fatma 57
> weight
Ali Deniz Fatma Emre Volkan Onur 60 72 57 90 95 72
> weight[c(1,3,5)]
Ali Fatma Volkan 60 57 95
> weight[2:4]
Deniz Fatma Emre 72 57 90
Used to indicate omitted elements
> weight
Ali Deniz Fatma Emre Volkan Onur 60 72 57 90 95 72
> weight[c(-1,-3,-5)]
Deniz Emre Onur 72 90 72
Useful when you need almost all elements
Can be indexed by a logical vector
Must be of the same length of the vector
> weight>72
Ali Deniz Fatma Emre Volkan Onur FALSE FALSE FALSE TRUE TRUE FALSE
> weight[weight>72]
Emre Volkan 90 95
If a vector has names, we can use them:
> weight[c("Deniz", "Volkan", "Fatma")]
Deniz Volkan Fatma 72 95 57
> names(vector)
NULL