Finding Patterns in Nature and Genetics

March 8, 2019

Computational Thinking

We know that Computational Thinking has four parts

Decomposition
Pattern recognition
Abstraction
Algorithm design

Today we will talk about patterns

Patterns

in the dictionary

a repeated decorative design: a wallpaper pattern
an arrangement or sequence regularly found in comparable objects: the house had been built on the usual pattern
a regular form or sequence discernible in certain actions or situations: the problems followed a repeated pattern

Science is about Patterns

The goal of Science is to understand Nature
We find the laws that explain Patterns in Nature
The first step is to find Patterns
- Day & Night, Winter & Summer are Patterns
- Understanding seasons allows us to survive
- Some food makes us healthy
- Understanding which food makes us sick is essential to survive

Patterns in Technology

Patterns in Nature

Patterns in Art

Patterns in Decoration

Patterns in Music

Patterns in programs: loops

Patterns in programs

When we see a pattern we can make

loops, if we need to repeat the pattern in one place
- Like drawing each angle in a star
functions, if the pattern is used in several places
- Like drawing several stars in different places.

There is another kind of pattern

Recursive patterns in Nature

Recursive patterns in Trees

Recursive patterns in Rivers

Recursive patterns in the Body

A Pattern can be part of itself

Sometimes the easiest way to define a pattern is to use the same pattern.

For example:

A tree has a trunk and branches. Each branch is a trunk with branches.
- In other words, each branch is a little tree
- Real trees or phylogenetic trees
Rivers are made of small rivers
blood vessels connect smaller blood vessels

Recursive functions

To represent these patterns we use functions that are part of themselves

example<-function(input) {
    code
    example(data)
    code
    return(output)
}

Example of recursive function

One useful mathematical function is factorial

The factorial of a \(n\) is \(n!\), defined as \[n! = n\cdot(n-1)!\]

That is, the factorial of n is n times the factorial of n-1

How do we write it in R?

Let’s write our own factorial function

factorial <- function(n) {
    ans <- n * factorial(n-1)
    return(ans)
}

There is an error here. Can you find it?

Use the debugger to see what is happening

Exit condition

Each recursive function needs an exit condition

That is, some rule to decide when to finish

There is always an easy case that does not require recursion

For example, the factorial of 1 is \(1! = 1\).

Exit condition for factorial

factorial <- function(n) {
    if(n <= 1) {
        ans <- 1
    } else {
        ans <- n * factorial(n-1)
    }
    return(ans)
}

Conditionals in Scratch

if <> then

There are two version: if-then and if-then-else

These command takes one logic condition and one or two code blocks

Most of times we use only if-then

Conditionals in R

if(cond) {
    code if true
}

if(cond) {
    code if true
} else {
    code if false
}

Example of decision

library(TurtleGraphics)
turtle_init(400, 400, mode="clip")
turtle_setpos(100, 250)
turtle_setangle(60)
for(i in 1:40) {
  turtle_move(200)
  turtle_right(10)
  if(i>20) {
      turtle_col("blue")
  }
}

Example of recursive function to find minimum

find_min <-function(vector) {
    n <- length(vector)
    if(n==1) {
        return(vector[1])
    } else {
        a <- find_min(vector[1:(n/2)])
        b <- find_min(vector[(n/2+1):n])
        if(a<b) {
            return(a)
        } else {
            return(b)
        }
    }
}

Summary

use if() to make decisions
- use if(cond){A} to do A only when cond is TRUE
- use if(cond){A} else {B} to do either A or B but not both
a recursive function is used inside itself
- there must be an exit condition
We have all the “lego” pieces we will need
- loops
- decisions
- functions

Big picture

You have learned the words (vocabulary)
You have learned the rules (grammar)
Now it is time to write some poetry
- At least some stories

Next steps

We will use computational thinking to solve real problems
We start with DNA analysis
Then we will learn how to analyze systems
- Such as chemical reactions, cells, ecosystems, and organizations
We will simulate systems and find how they react to environmental changes

Let’s have the break now

Figures from DNA

DNA

A big molecule, but not too complex. It is a polymer
All made of only 4 pieces (pattern again!)

Proteins

Looks more complex, but it is still a polymer

DNA and proteins are easy to model

Despite being large molecules, DNA and proteins are made with pieces of only a few types

DNA is made of four bases. We can represent it with four letters

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Proteins have 20 aminoacids and three stop codons. We can represent them with 23 letters.

MKRISTTITTTITITTGNGAG

Sequence data

In molecular biology we often work with sequences

DNA sequences use 4 letters to represent the nucleotides in one of the two strands
Protein sequences use 20 letters to represent the amino-acids, from amino to carboxyl terminal
Other sequences are sometime used:
- RNA,
- DNA with ambiguous nucleotides,
- amino-acid sequences with stop codons

Sequence data is digital data

The main reason why computing is useful for molecular biology

DNA is discrete data
Either “A”, “C”, “G” or “T”
No middle-values
All other things we measure are in a range
- Like temperature, concentration, expression
We can forget all the chemistry and work with symbols

Example

DNA looks like this

and many more lines

ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT
GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT
TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT
TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT
ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC
TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA
ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA
AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT
TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT
TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT
TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC
ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT
TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT
AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA

We can use the Turtle to see the DNA

For each nucleotide
- if the nucleotide is A, turtle moves left
- if the nucleotide is T, turtle moves right
- if the nucleotide is G, turtle moves up
- if the nucleotide is C, turtle moves down

The Turtle walks the DNA sequence

library(TurtleGraphics)
turtle_init(mode = "clip")
turtle_hide()
for(i in 1:1000) {
    if(dna[i]=="A") {
        turtle_setangle(90)
    } else if(dna[i]=="T") {
        turtle_setangle(270)
    } else if(dna[i]=="G") {
        turtle_setangle(0)
    } else if(dna[i]=="C") {
        turtle_setangle(180)
    }
    turtle_forward(1)
}

After a long time we get this

This program takes a lot of time (it is a turtle, after all)

This is the drawing of Candidatus Carsonella ruddii genome

It is the bacteria with the smallest genome

Same plot for Escherichia coli

Same plot for Schizosaccaromyces Pombe

Scientific questions about DNA-walk

Why do we got this shape?
What is the shape for other organisms?
Do you think this shape is random?
Do you think there is some logic? some structure?
What are the minimum and maximum of each axis?

How can you answer these questions?

Technical questions

How can you answer the previous questions?
How can we make these drawings faster?
How can we extract meaningful biological information from these drawings?

Handling DNA Sequences in R

Sequences are stored in FASTA format

There are several ways to store DNA or protein data

Most of the times they are stored in FASTA format

FASTA files are text files, with some rules

You should never use word to store sequences

Example FASTA file

>AP009180.1 Candidatus Carsonella ruddii PV DNA, complete genome
ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT
GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT
TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT
TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT
ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC
TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA
ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA
AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT
TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT
TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT
TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC
ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT
TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT
AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA

… and more

FASTA file (amino acids)

>NC_000913.3_prot_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
MKRISTTITTTITITTGNGAG
>NC_000913.3_prot_NP_414543.1_2 [gene=thrA] [protein=Bifunctional aspartokinase/homoserine dehydrogenase 1] [protein_id=NP_414543.1] [location=337..2799]
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM
KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK
VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
>NC_000913.3_prot_NP_414544.1_3 [gene=thrB] [protein=homoserine kinase] [protein_id=NP_414544.1] [location=2801..3733]
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE
RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY
DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF
IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA
DWLGKNYLQNQEGFVHICRLDTAGARVLEN
>NC_000913.3_prot_NP_414545.1_4 [gene=thrC] [protein=L-threonine synthase] [protein_id=NP_414545.1] [location=3734..5020]
MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE
ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA
VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS
ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP
RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS
EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL
RKLMMNHQ
>NC_000913.3_prot_NP_414546.1_5 [gene=yaaX] [protein=DUF2502 family putative periplasmic protein] [protein_id=NP_414546.1] [location=5234..5530]
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL
HGPPPPPRHHKKAPHDHHGGHGPGKHHR
>NC_000913.3_prot_NP_414547.1_6 [gene=yaaA] [protein=peroxide resistance protein, lowers intracellular iron] [protein_id=NP_414547.1] [location=complement(5683..6459)]
MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD
FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG

Analyzing Biological Sequences in R

For simple cases you can use data from my blog

In most cases we get our sequences from NCBI

NCBI stores all public biological sequences at
https://www.ncbi.nlm.nih.gov/nuccore

Anybody can upload sequences, and they may be wrong

NCBI has a curation process to validate the sequences

If a sequence is good enough to be a reference, then it is stored in the RefSeq collection https://www.ncbi.nlm.nih.gov/refseq

It is easy if you know the Accession Number

Accession numbers are the best way to identify a biological sequence

Candidatus Carsonella ruddii PV DNA has accession AP009180
Escherichia coli str. K-12 substr. MG1655 has accession NC_000913

Different sequences can have the same name, but never the same accession

If you do not know the accession use Taxonomy

NCBI search box looks for patterns everywhere, not only on the title

and many sequences can have the same name

To be sure of using the correct sequence, go to NCBI Taxonomy

https://www.ncbi.nlm.nih.gov/taxonomy

Try with “Escherichia coli”. There are many

Once you find the correct species…

You will see that each organism has a taxonomic ID

You click on Nucleotide–Direct Links

In the search box add the phrase “complete genome”

When you find the correct one..

You download the FASTA file

Store it on your computer, and change the name

If you use the Lab’s computers, save it on X:

Reading FASTA in R

To handle sequence data in R, we use the seqinr library

You have to install it once.

install.packages("seqinr")

Then you have to load it on every session

library(seqinr)

read.fasta

read FASTA formatted files

read.fasta(file, seqtype = "DNA", ...)

Inputs

file: The name of the file which the sequences in FASTA format are to be read from
seqtype: the nature of the sequence: DNA or AA, defaulting to DNA

Output

A list of vectors of chars. Each element is a sequence object.

Example

library(seqinr)
sequences <- read.fasta("NC_000913.fna")
dna <- sequences[[1]]

What have we learned?

Some organisms have been sequenced
We can read the sequences from FASTA files
read.fasta returns a list of vectors of characters
We can create functions to do things with the sequence
Functions are black boxes with inputs and outputs
For example we do statistics of the sequence

Why we do statistics on the sequence?

Statistics is a way to tell a story that makes sense of the data

In genomics, we look for biological sense

That story can be global: about the complete genome

Or can be local: about some region of the genome

We will start with global properties

GC-content

From Wikipedia, the free encyclopedia

The percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine.

GC content is found to be variable with different organisms.
The committee on bacterial systematics has recommended use of GC ratios in higher level hierarchical classification

GC-content can be measured by several means

Measuring the melting temperature of the DNA double helix using spectrophotometry

The absorbance of DNA at a wavelength of 260nm increases when DNA separates into two single strands at melting temperature

Determination of GC content

If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.

GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]

Exercise 1

Determine the GC content of E.coli

Exercise 2

Is this GC content uniform through all genome?

The result should depend on:

the genomic sequence
the position of the window
the size of the window

Define the function

gc_content <- function(sequence, position, size) {
    # count letters
    return(ans)
}

Using `gc_content()` in many places

We want to evaluate gc_content on different positions of the genome

positions <- seq(from=1, to=length(dna), by= window_size)

Now we apply gc_content to each element of positions?

for(pos in seq(from=1, to=length(dna), by= window_size)) {
    skew[pos] <- gc_skew(pos, dna, window_size)
}

GC Skew

What is the difference \(G-C\) respect to the total \(G+C\)?

Calculate the ratio \[\frac{G-C}{G+C}\] using sliding windows and plot it

What do you see?

GC skew

There is a richness of G over C and T over A in the leading strand

There is a richness of C over G and A over T in the lagging strand

GC skew changes sign at the boundaries of the two replichores

This corresponds to DNA replication origin or terminus

What is the story told by GC skew?

GC skew and genome replication

GC skew changes sign at the boundaries of the two replichores

This corresponds to DNA replication origin or terminus

The replication origin is usually called ori.

How DNA replicates

Comparative Genometrics

If you want to learn more about the science, read:

Roten C-AH, Gamba P, Barblan J-L, Karamata D. Comparative Genometrics (CG): a database dedicated to biometric comparisons of whole genomes. Nucleic Acids Research. 2002;30(1):142-144

Computational Thinking

Patterns

in the dictionary

Science is about Patterns

Patterns in Technology

Patterns in Technology

Patterns in Nature

Patterns in Nature

Patterns in Nature

Patterns in Nature

Patterns in Art

Patterns in Decoration

Patterns in Music

Patterns in programs: loops

Patterns in programs

Recursive patterns in Nature

Recursive patterns in Trees

Recursive patterns in Rivers

Recursive patterns in the Body

A Pattern can be part of itself

Recursive functions

Example of recursive function

Let’s write our own factorial function

Exit condition

Exit condition for factorial

Conditionals in Scratch

Conditionals in R

Example of decision

Example of recursive function to find minimum

Summary

Big picture

Next steps

Let’s have the break now

Figures from DNA

DNA

Proteins

Looks more complex, but it is still a polymer

DNA and proteins are easy to model

Sequence data

Sequence data is digital data

Example

DNA looks like this

and many more lines

We can use the Turtle to see the DNA

The Turtle walks the DNA sequence

After a long time we get this

Same plot for Escherichia coli

Same plot for Schizosaccaromyces Pombe

Scientific questions about DNA-walk

Technical questions

Handling DNA Sequences in R

Sequences are stored in FASTA format

Example FASTA file

FASTA file (amino acids)

Analyzing Biological Sequences in R

For simple cases you can use data from my blog

In most cases we get our sequences from NCBI

It is easy if you know the Accession Number

If you do not know the accession use Taxonomy

Once you find the correct species…

When you find the correct one..

Reading FASTA in R

read.fasta

read FASTA formatted files

Inputs

Output

Example

What have we learned?

What have we learned?

Why we do statistics on the sequence?

GC-content

From Wikipedia, the free encyclopedia

GC-content can be measured by several means

Determination of GC content

Exercise 1

Exercise 2

Define the function

Using gc_content() in many places

GC Skew

GC skew

Using `gc_content()` in many places