We know that Computational Thinking has four parts
- Decomposition
- Pattern recognition
- Abstraction
- Algorithm design
Today we will talk about patterns
March 8, 2019
We know that Computational Thinking has four parts
Today we will talk about patterns
When we see a pattern we can make
There is another kind of pattern
Sometimes the easiest way to define a pattern is to use the same pattern.
For example:
To represent these patterns we use functions that are part of themselves
example<-function(input) { code example(data) code return(output) }
One useful mathematical function is factorial
The factorial of a \(n\) is \(n!\), defined as \[n! = n\cdot(n-1)!\]
That is, the factorial of n
is n
times the factorial of n-1
How do we write it in R?
factorial <- function(n) { ans <- n * factorial(n-1) return(ans) }
There is an error here. Can you find it?
Use the debugger to see what is happening
Each recursive function needs an exit condition
That is, some rule to decide when to finish
There is always an easy case that does not require recursion
For example, the factorial of 1 is \(1! = 1\).
factorial <- function(n) { if(n <= 1) { ans <- 1 } else { ans <- n * factorial(n-1) } return(ans) }
if <> then
There are two version: if-then and if-then-else
These command takes one logic condition and one or two code blocks
Most of times we use only if-then
if(cond) { code if true }
if(cond) { code if true } else { code if false }
library(TurtleGraphics) turtle_init(400, 400, mode="clip") turtle_setpos(100, 250) turtle_setangle(60) for(i in 1:40) { turtle_move(200) turtle_right(10) if(i>20) { turtle_col("blue") } }
find_min <-function(vector) { n <- length(vector) if(n==1) { return(vector[1]) } else { a <- find_min(vector[1:(n/2)]) b <- find_min(vector[(n/2+1):n]) if(a<b) { return(a) } else { return(b) } } }
if()
to make decisions
if(cond){A}
to do A
only when cond
is TRUEif(cond){A} else {B}
to do either A
or B
but not bothDespite being large molecules, DNA and proteins are made with pieces of only a few types
DNA is made of four bases. We can represent it with four letters
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
Proteins have 20 aminoacids and three stop codons. We can represent them with 23 letters.
MKRISTTITTTITITTGNGAG
In molecular biology we often work with sequences
The main reason why computing is useful for molecular biology
ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA
A
, turtle moves leftT
, turtle moves rightG
, turtle moves upC
, turtle moves downlibrary(TurtleGraphics) turtle_init(mode = "clip") turtle_hide() for(i in 1:1000) { if(dna[i]=="A") { turtle_setangle(90) } else if(dna[i]=="T") { turtle_setangle(270) } else if(dna[i]=="G") { turtle_setangle(0) } else if(dna[i]=="C") { turtle_setangle(180) } turtle_forward(1) }
This program takes a lot of time (it is a turtle, after all)
This is the drawing of Candidatus Carsonella ruddii genome
It is the bacteria with the smallest genome
How can you answer these questions?
There are several ways to store DNA or protein data
Most of the times they are stored in FASTA format
FASTA files are text files, with some rules
You should never use word to store sequences
>AP009180.1 Candidatus Carsonella ruddii PV DNA, complete genome ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA
… and more
>NC_000913.3_prot_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255] MKRISTTITTTITITTGNGAG >NC_000913.3_prot_NP_414543.1_2 [gene=thrA] [protein=Bifunctional aspartokinase/homoserine dehydrogenase 1] [protein_id=NP_414543.1] [location=337..2799] MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV >NC_000913.3_prot_NP_414544.1_3 [gene=thrB] [protein=homoserine kinase] [protein_id=NP_414544.1] [location=2801..3733] MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA DWLGKNYLQNQEGFVHICRLDTAGARVLEN >NC_000913.3_prot_NP_414545.1_4 [gene=thrC] [protein=L-threonine synthase] [protein_id=NP_414545.1] [location=3734..5020] MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL RKLMMNHQ >NC_000913.3_prot_NP_414546.1_5 [gene=yaaX] [protein=DUF2502 family putative periplasmic protein] [protein_id=NP_414546.1] [location=5234..5530] MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL HGPPPPPRHHKKAPHDHHGGHGPGKHHR >NC_000913.3_prot_NP_414547.1_6 [gene=yaaA] [protein=peroxide resistance protein, lowers intracellular iron] [protein_id=NP_414547.1] [location=complement(5683..6459)] MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG
NCBI stores all public biological sequences at
https://www.ncbi.nlm.nih.gov/nuccore
Anybody can upload sequences, and they may be wrong
NCBI has a curation process to validate the sequences
If a sequence is good enough to be a reference, then it is stored in the RefSeq collection https://www.ncbi.nlm.nih.gov/refseq
Accession numbers are the best way to identify a biological sequence
Different sequences can have the same name, but never the same accession
NCBI search box looks for patterns everywhere, not only on the title
and many sequences can have the same name
To be sure of using the correct sequence, go to NCBI Taxonomy
https://www.ncbi.nlm.nih.gov/taxonomy
Try with “Escherichia coli”. There are many
You will see that each organism has a taxonomic ID
You click on Nucleotide–Direct Links
In the search box add the phrase “complete genome”
You download the FASTA file
Store it on your computer, and change the name
If you use the Lab’s computers, save it on X:
To handle sequence data in R, we use the seqinr library
You have to install it once.
install.packages("seqinr")
Then you have to load it on every session
library(seqinr)
read.fasta(file, seqtype = "DNA", ...)
A list of vectors of chars. Each element is a sequence object.
library(seqinr) sequences <- read.fasta("NC_000913.fna") dna <- sequences[[1]]
read.fasta
returns a list of vectors of charactersStatistics is a way to tell a story that makes sense of the data
In genomics, we look for biological sense
That story can be global: about the complete genome
Or can be local: about some region of the genome
We will start with global properties
The percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine.
Measuring the melting temperature of the DNA double helix using spectrophotometry
If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.
GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]
Determine the GC content of E.coli
Is this GC content uniform through all genome?
The result should depend on:
gc_content <- function(sequence, position, size) { # count letters return(ans) }
gc_content()
in many placesWe want to evaluate gc_content
on different positions of the genome
positions <- seq(from=1, to=length(dna), by= window_size)
Now we apply gc_content
to each element of positions
?
for(pos in seq(from=1, to=length(dna), by= window_size)) { skew[pos] <- gc_skew(pos, dna, window_size) }
What is the difference \(G-C\) respect to the total \(G+C\)?
Calculate the ratio \[\frac{G-C}{G+C}\] using sliding windows and plot it
What do you see?
There is a richness of G over C and T over A in the leading strand
There is a richness of C over G and A over T in the lagging strand
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
GC skew changes sign at the boundaries of the two replichores
This corresponds to DNA replication origin or terminus
The replication origin is usually called ori.
If you want to learn more about the science, read:
Roten C-AH, Gamba P, Barblan J-L, Karamata D. Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Research. 2002;30(1):142-144