March 10, 2020

DNA

  • A big molecule, but not too complex. It is a polymer
  • All made of only 4 pieces (so there is a pattern)

Proteins

Looks more complex, but it is still a polymer

DNA and proteins are easy to model

Despite being large molecules, DNA and proteins are made with pieces of only a few types

DNA is made of four bases. We can represent it with four letters

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

Proteins have 20 aminoacids. We can represent them with 20 letters.

MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI

Sequence data

In molecular biology we often work with sequences

  • DNA sequences use 4 letters to represent the nucleotides in one of the two strands
  • Protein sequences use 20 letters to represent the amino-acids, from amino to carboxyl terminal
  • Other sequences are sometime used:
    • RNA,
    • DNA with ambiguous nucleotides,
    • amino-acid sequences with stop codons

Sequence data is digital data

The main reason why computing is useful for molecular biology

  • DNA is discrete data
  • Either “A”, “C”, “G” or “T”
  • No middle-values
  • All other things we measure are in a range
    • Like temperature, concentration, expression
  • We can forget all the chemistry and work with symbols

Example

DNA looks like this

and many more lines

ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT
GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT
TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT
TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT
ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC
TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA
ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA
AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT
TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT
TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT
TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC
ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT
TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT
AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA

Handling DNA Sequences in R

Sequences are stored in FASTA format

There are several ways to store DNA or protein data

Most of the times they are stored in FASTA format

FASTA files are text files, with some rules

Microsoft Word files (doc or docx) are NOT text files

You should never use Microsoft Word to store sequences

Example FASTA file (genome)

>AP009180.1 Candidatus Carsonella ruddii PV DNA, complete genome
ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT
GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT
TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT
TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT
ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC
TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA
ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA
AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT
TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT
TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT
TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC
ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT
TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT
AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA

… and more

FASTA file (amino acids)

>NC_000913.3_prot_NP_414542.1_1 [gene=thrL] [protein=thr operon leader peptide] [protein_id=NP_414542.1] [location=190..255]
MKRISTTITTTITITTGNGAG
>NC_000913.3_prot_NP_414543.1_2 [gene=thrA] [protein=Bifunctional aspartokinase/homoserine dehydrogenase 1] [protein_id=NP_414543.1] [location=337..2799]
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM
KFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFK
VKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV
>NC_000913.3_prot_NP_414544.1_3 [gene=thrB] [protein=homoserine kinase] [protein_id=NP_414544.1] [location=2801..3733]
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLGRFADKLPSEPRENIVYQCWE
RFCQELGKQIPVAMTLEKNMPIGSGLGSSACSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHY
DNVAPCFLGGMQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDCIAHGRHLAGF
IHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVAEIGAVASGISGSGPTLFALCDKPETAQRVA
DWLGKNYLQNQEGFVHICRLDTAGARVLEN
>NC_000913.3_prot_NP_414545.1_4 [gene=thrC] [protein=L-threonine synthase] [protein_id=NP_414545.1] [location=3734..5020]
MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLDFVTRSAKILSAFIGDEIPQE
ILEERVRAAFAFPAPVANVESDVGCLELFHGPTLAFKDFGGRFMAQMLTHIAGDKPVTILTATSGDTGAA
VAHAFYGLPNVKVVILYPRGKISPLQEKLFCTLGGNIETVAIDGDFDACQALVKQAFDDEELKVALGLNS
ANSINISRLLAQICYYFEAVAQLPQETRNQLVVSVPSGNFGDLTAGLLAKSLGLPVKRFIAATNVNDTVP
RFLHDGQWSPKATQATLSNAMDVSQPNNWPRVEELFRRKIWQLKELGYAAVDDETTQQTMRELKELGYTS
EPHAAVAYRALRDQLNPGEYGLFLGTAHPAKFKESVEAILGETLDLPKELAERADLPLLSHNLPADFAAL
RKLMMNHQ
>NC_000913.3_prot_NP_414546.1_5 [gene=yaaX] [protein=DUF2502 family putative periplasmic protein] [protein_id=NP_414546.1] [location=5234..5530]
MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHL
HGPPPPPRHHKKAPHDHHGGHGPGKHHR
>NC_000913.3_prot_NP_414547.1_6 [gene=yaaA] [protein=peroxide resistance protein, lowers intracellular iron] [protein_id=NP_414547.1] [location=complement(5683..6459)]
MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLMRISDKLAGINAARFHDWQPD
FTPANARQAILAFKGDVYTGLQAETFSEDDFDFAQQHLRMLSGLYGVLRPLDLMQPYRLEMGIRLENARG

Analyzing Biological Sequences in R

For simple cases you can use data from my blog

Candidatus Carsonella ruddii has the smallest genome known

We use this genome in classes because it is a small example

Candidatus Carsonella ruddii is a obligate symbiont of Pachpsylla venusta.

There is much controversy over whether this is a living cell or simply an organelle as it is missing genes needed for living independently.

Published as “The 160-kilobase genome of the bacterial endosymbiont Carsonella.” https://www.ncbi.nlm.nih.gov/pubmed/17038615

In most cases we get our sequences from NCBI

It is easy if you know the Accession Number

Accession numbers are the best way to identify a biological sequence

  • Candidatus Carsonella ruddii PV DNA has accession AP009180
  • Escherichia coli str. K-12 substr. MG1655 has accession NC_000913

Different sequences can have the same name, but never the same accession

If you do not know the accession use Taxonomy

NCBI search box looks for patterns everywhere, not only on the title

and many sequences can have the same name

To be sure of using the correct sequence, go to NCBI Taxonomy

https://www.ncbi.nlm.nih.gov/taxonomy

Try with “Escherichia coli”. There are many

Once you find the correct species…

 

You will see that each organism has a taxonomic ID

You click on Nucleotide–Direct Links

In the search box add the phrase “complete genome”

When you find the correct one..

 

You download the FASTA file

Store it on your computer, and change the name

If you use the Lab’s computers, save it on X:

Reading FASTA in R

To handle sequence data in R, we use the seqinr library

You have to install it once.

install.packages("seqinr")

Then you have to load it on every session

library(seqinr)

read.fasta

read FASTA formatted files

read.fasta(file, seqtype = "DNA", ...)

Inputs

file
The name of the file which the sequences in FASTA format are to be read from
seqtype
the nature of the sequence: DNA or AA, defaulting to DNA

Output

A list of vectors of chars. Each element is a sequence object.

Example

library(seqinr)
sequences <- read.fasta("NC_000913.faa")

This is a list of protein sequences

The first sequence is sequences[[1]]

What have we learned?

What have we learned?

  • Some organisms have been sequenced
  • We can read the sequences from FASTA files
  • read.fasta returns a list of vectors of characters
  • We can create functions to do things with the sequence
  • For example we do statistics of the sequence

Why we do statistics on the sequence?

Statistics is a way to tell a story that makes sense of the data

In genomics, we look for biological sense

That story can be global: about the complete genome

Or can be local: about some region of the genome

We will start with global properties

GC-content

From Wikipedia, the free encyclopedia

The percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine.

  • GC content is found to be variable with different organisms.
  • The committee on bacterial systematics has recommended use of GC ratios in higher level hierarchical classification

GC-content can be measured by several means

Measuring the melting temperature of the DNA double helix using spectrophotometry

  • The absorbance of DNA at a wavelength of 260nm increases when DNA separates into two single strands at melting temperature

Determination of GC content

If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.

GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]

Exercise 1

Write the code for this function:

genome_gc_content <- function(genome) {
    # count letters
    return(answer)
}

Then use it to find the GC content of E.coli

Exercise 2

Is the GC content uniform through all genome?

Do all genes have the same GC content?

What function do we need to answer this question?

What is the name, the inputs, and the output?

Exercise 2

Write the code for this function:

genes_gc_content <- function(genes) {
    # calculate GC content of each gene
    return(answer)
}

What is the data structure of answer?

Exercise 3

Calculate the GC content for only part of the genome?

Instead of all the genome, we only look through a window

The result should depend on:

  • the genomic sequence
  • the position of the window
  • the size of the window

Write the function

The new function looks like this

gc_content <- function(sequence, position, size) {
    # code here
    return(answer)
}

What should be the code?

Using gc_content() in many places

We want to evaluate gc_content on different positions of the genome

positions <- seq(from=1, to=length(genome), by= window_size)

Now we apply gc_content to each element of positions?

for(i in 1:length(positions)) {
    gc[i] <- gc_content(genome, positions[i], window_size)
}