Class 2: Handling DNA in the computer

Computing for Molecular Biology 2

Andrés Aravena, PhD

12 March 2021

DNA

  • A big molecule, but not too complex. It is a polymer
  • All made of only 4 pieces (so there is a pattern)

Proteins

They seen more complex, but they are still polymers

DNA and proteins are easy to model

DNA and proteins are large molecules

They are made using only a few types of pieces

And they form a string or chain

They are easy to be represented by symbols

DNA is made of four bases

We can represent it with four letters. The sequence

ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT

uses only A, C, G, T.

Proteins have 20 amino-acids

We can represent proteins as combinations of 20 letters.

MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERI
FAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEA
RGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYS
AAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPC
LIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSW
LKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAV
ADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELM

Sequence data

In molecular biology we often work with sequences

  • DNA sequences use 4 letters to represent the nucleotides in one of the two strands, from 5’ to 3’
  • Protein sequences use 20 letters to represent the amino-acids, from amino to carboxyl terminal
  • Other sequences are sometime used:
    • RNA,
    • DNA with ambiguous nucleotides,
    • amino-acid sequences with stop codons

Sequence data is digital data

The main reason why computing is useful for molecular biology

  • DNA is discrete data
  • Either “A”, “C”, “G” or “T”
    • There is nothing in between.
  • Amino-acids are also discrete values

All other things we measure are continuous

For example:

  • temperature,
  • concentration,
  • gene expression

Each of these values are numbers with decimals,
and with a margin of error

We can forget all the chemistry and work with symbols

Sequence statistics

Why we do statistics on the sequence?

Statistics is a way to tell a story that makes sense of the data

In genomics, we look for biological sense

That story can be global: about the complete genome

Or can be local: about some region of the genome

We will start with global properties

GC-content

From Wikipedia, the free encyclopedia

The percentage of nitrogenous bases on a DNA molecule that are either guanine or cytosine.

  • GC content is found to be variable with different organisms.
  • The committee on bacterial systematics has recommended use of GC ratios in higher level hierarchical classification

GC-content can be measured by several means

Measuring the melting temperature of the DNA double helix using spectrophotometry

The absorbance of DNA at a wavelength of 260nm increases when DNA separates into two single strands at melting temperature

Determination of GC content

If the DNA has been sequenced then the GC-content can be accurately calculated by simple arithmetic.

GC-content percentage is calculated as \[\frac{G+C}{A+T+G+C}\]

Exercise 1

Write a step-by-step plan to find the GC content of the first gene of E.coli

Write it in English

Representing DNA Sequences in the computer

Sequences are stored in FASTA format

There are several ways to store DNA or protein data

Most of the times they are stored in FASTA format

FASTA files are text files, with some rules

Microsoft Word is not your friend

Microsoft Word files (doc or docx) are NOT text files

You should never use Microsoft Word to store sequences

A good alternative is Visual Studio Code, also by Microsoft

Example of genome in FASTA format

>AP009180.1 Candidatus Carsonella ruddii PV DNA, complete genome
ATGAATACTATATTTTCAAGAATAACACCATTAGGAAATGGTACGTTATGTGTTATAAGAATTTCTGGAA
AAAATGTAAAATTTTTAATACAAAAAATTGTAAAAAAAAATATAAAAGAAAAAATAGCTACTTTTTCTAA
ATTATTTTTAGATAAAGAATGTGTAGATTATGCAATGATTATTTTTTTTAAAAAACCAAATACGTTCACT
GGAGAAGATATAATCGAATTTCATATTCACAATAATGAAACTATTGTAAAAAAAATAATTAATTATTTAT
TATTAAATAAAGCAAGATTTGCAAAAGCTGGCGAATTTTTAGAAAGACGATATTTAAATGGAAAAATTTC
TTTAATAGAATGCGAATTAATAAATAATAAAATTTTATATGATAATGAAAATATGTTTCAATTAACAAAA
AATTCTGAAAAAAAAATATTTTTATGTATAATTAAAAATTTAAAATTTAAAATAAATTCTTTAATAATTT
GTATTGAAATCGCAAATTTTAATTTTAGTTTTTTTTTTTTTAATGATTTTTTATTTATAAAATATACATT
TAAAAAACTATTAAAACTTTTAAAAATATTAATTGATAAAATAACTGTTATAAATTATTTAAAAAAGAAT
TTCACAATAATGATATTAGGTAGAAGAAATGTAGGAAAGTCTACTTTATTTAATAAAATATGTGCACAAT
ATGACTCGATTGTAACTAATATTCCTGGTACTACAAAAAATATTATATCAAAAAAAATAAAAATTTTATC
TAAAAAAATAAAAATGATGGATACAGCAGGATTAAAAATTAGAACTAAAAATTTAATTGAAAAAATTGGA
ATTATTAAAAATATAAATAAAATTTATCAAGGAAATTTAATTTTGTATATGATTGATAAATTTAATATTA
AAAATATATTTTTTAACATTCCAATAGATTTTATTGATAAAATTAAATTAAATGAATTAATAATTTTAGT
TAACAAATCAGATATTTTAGGAAAAGAAGAAGGAGTTTTTAAAATAAAAAATATATTAATAATTTTAATT
TCTTCTAAAAATGGAACTTTTATAAAAAATTTAAAATGTTTTATTAATAAAATCGTTGATAATAAAGATT
TTTCTAAAAATAATTATTCTGATGTTAAAATTCTATTTAATAAATTTTCTTTTTTTTATAAAGAATTTTC
ATGTAACTATGATTTAGTGTTATCAAAATTAATTGATTTTCAAAAAAATATATTTAAATTAACAGGAAAT
TTTACTAATAAAAAAATAATAAATTCTTGTTTTAGAAATTTTTGTATTGGTAAATGAATATTTTTAATAT
AATTATTATTGGAGCAGGACATTCTGGTATAGAAGCAGCTATATCTGCATCTAAAATATGTAATAAAATA
AAAATAATTACTTCAAATTTAGAAAACTTAGGTATAATGTCTTGTAATCCTTCAATAGGAGGTATTGGAA
AATCACATTTAGTTAAAGAATTAGAATTATTTGGTGGAATAATGCCAGAAGCATCTGATTATAGTAGAAT
ACATTCTAAATTATTAAATTATAAAAAAGGAGAATCTGTTCATTCTTTAAGATATCAAATTGATAGAATT
TTATATAAAAATTACATATTGAAAATTTTATTTTTAAAAAAAAATATTTTAATAGAACAAAATGAAATAA
ATAAAATTATTAGATTTAAAAAAAAAATTTTAATCTTTAACAAATTAAAATTTTTTAATATAGCAAAAAT
TATTATTGTTTGTGCTGGTACTTTTATTAATTCTAAAATATATATAGGCAAAAATATTAAAGCTTTGAAC
AAAGCAGAAAAAAAATCTATTTCTTATTCTTTTAAAAAAATAAATTTATTTATTTCAAAATTAAAAACAG
GCACACCTCCAAGATTAGATTTAAATTATTTAAATTATAAAAAATTAAGTGTTCAATATAGTGATTATAC
TATTTCATATGGTAAAAATTTCAATTTTAATAATAACGTAAAATGCTTTATAACAAATACTGATAATAAA
ATTAATAACTTTATTAAAAAAAATATTAAAAATTCATCTTTATTTAATTTAAAATTTAAATCTATAGGAC
CCAGATATTGTCCAAGTATTGAAGATAAAATTTTTAAATTTCCAAATAATAAAAATCATCAAATTTTTTT
AGAGCCAGAAAGTTATTTTAGTAAAGAAATTTACGTTAATGGATTATCTAATTCATTATCTTATAATATT
CAAAAAAAATTAATAAAAAAAATTTTAGGAATTAAAAAAAGTTATATTATAAGATATGCGTATAATATTC
AATATGATTATTTTGACCCTAGGTGTTTAAAAATTTCTTTAAATATTAAATTTGCTAATAATATATTTTT
AGCAGGACAAATTAATGGTACAACTGGTTATGAAGAAGCTTCTTCACAAGGTTTTGTTGCAGGAATAAAT
TCCGCAAGAAAAATTTTAAAACTACCTTTATGGAAACCAAAAAAATGGAATTCTTATATAGGAGTTTTAT
TGTATGACTTAACTAATTTTGGAATTCAAGAACCTTATAGAATTTTTACTTCAAAATCAGACAATCGCTT
ATTTTTAAGATTTGATAATGCAATATTTAGATTAATAAATATTTCTTATTATTTAGGATGTTTACCTATT
GTTAAATTTAAATATTATAATTCTTTAATATACAAATTTTACAAAAATTTAATTAATATTAGAAAAATAA
AGTTATTTGATAATTTTTATTTGTTTAAGTTAATAATTATAATGTCAAAATATTATGGTTATATTAAAAA
AAAATATTTTAAATAATTTTCTTAATTTTAAAATAATTGATTTAAATTTAATAATATTATTATTATTTAT
ACATTTAATTGTATTTTATTTATTAAAAAATAATAATTTAATGATATTATTATCAATATATTTAAACAAT
TTTATTAAAAATTCTATCAACCTAAATTCAAGAAATATAATTTTTTTTTTTTCACTAGTATTGTTTAATA
TAATATTATTTTCTAATTTTATTGATTTATTTCCAAATAATTTAATAAAAAATTTTTTAAATTTAAAACA
AATTGAAATTGTTCCAACTTCAAATATAAATATAACTTTTTGTTTTTCAATAATTTCTTTTTTAATAATT
ATAATGTTAACACATAAAAAAATAGGTTTTAAAAAGTATATATATAGTTTTTTTATTTATCCAATAAACA
CTGAATACTTATATTTATTTAATTTTATTATTGAAAGTATTTCTTATATAATGAAACCGATATCTTTATC
TTTAAGATTATTTGGAAATATTTTTTCTTCTGAAATTATATTTAATATAATTAATAATATGAATGTATTT
ATTAATAGTTTTTTAAATTTAATTTGGGGAATTTTTCATTTTATAATTTTACCTCTTCAATCTTTTATTT
TTATTACATTGGTTATAATATATGTTTCACAAACTTTAAATCATTAAAAAAAAAAATGAATAATTTATTA
ATATTATCTTCATCAATAATGATAGGATTATCATCTATTGGAACAGGTATAGGATTTGGAATTTTAGGAG
GAAAACTTTTAGATTCCATATCAAGACAACCAGAATTAGATAATTTATTATTAACTAGAACTTTTTTAAT
GACAGGATTATTAGATGCTATTCCAATGATAAGCGTAGGTATAGGTTTATACTTAATATTTGTTTTATCA
AATAAATAATATGAATTTCAATTATACTATTATTAATGAATTTGTATCTTTTTTAATTTTTTTTTATGTT
TCATTTAAAATTATATTTCCAGTTATATTAAAAAAAATAAATAATTTTTTAATAATTGATTATAAAAATT
TTGTTTTTAACAATCAAGAAAAAATTATTAAAAAAAAATTATTAGATGAAATAGTTAAAAACGAAAATTT
AACAAATAAGAAATTTATATCTTTAATAGAAAAAATAAAAAAAAGTATTTTATTAGAAAAACAAAATTTT
ATTAATTTTATAAAATTAGAAAAAATAAACGTTCTAAAAATTTTTAAAAAAAAAATATTAAATAATAATA
TGTTAATTATTAAAAACTTTTTAATTGAGATTAAAAAATTGTTTATAAATAGCTTTAAAAATATTTTTAA
TGAAATTATTTGTTATAACAATGAATTTATAATTAATTATGTTTAAATTTATAAACAGGTTTTTAAATTT
AAAAAAAAGATATTTTTATATTTTTTTAATAAATTTTTTTTATTTTTTTAATAAATGTAATTTTATTAAA
AAAAAAAAAATATATAAAAAAATAATTACTAAAAAATTTGAAAATTATTTATTAAAATTAATTATTCAAA
AATATGCTAAATGAAGGAATAATAAACAAAATTTATGATAGTGTAGTTGAAGTTCTTGGATTGAAAAATG
CTAAATATGGTGAAATGATTTTATTTAGTAAAAATATTAAAGGAATAGTATTCAGTTTAAACAAAAAAAA
TGTAAATATAATTATATTAAATAATTATAACGAGTTAACACAAGGAGAAAAATGTTATTGCACAAACAAA
ATATTTGAAGTTCCTGTTGGAAAACAATTAATAGGTAGAATAATAAATTCTAGAGGAGAAACTCTCGATT
TGTTACCAGAAATTAAAATAAATGAATTTTCACCTATTGAAAAAATAGCACCAGGTGTTATGGATAGAGA
AACAGTAAATGAGCCATTATTAACTGGAATAAAATCTATTGATTCAATGATTCCTATTGGAAAAGGACAA
CGAGAATTAATTATTGGTGATAGACAAACTGGAAAAACTACAATTTGTATTGATACTATTATTAATCAAA
AAAATAAAAATATTATTTGTGTTTATGTTTGTATAGGTCAAAAAATATCTTCTTTAATAAATATTATTAA
TAAGCTTAAAAAATTTAATTGCTTAGAATATACAATTATTGTAGCTTCAACTGCCTCAGATAGTGCAGCG
GAGCAGTATATTGCTCCATATACTGGAAGCACAATAAGTGAATATTTTCGTGATAAAGGACAAGATTGCC
TAATTGTTTATGATGATTTAACAAAACATGCTTGGGCATATAGACAAATTTCTTTACTATTAAGACGTCC
ACCTGGTCGTGAAGCTTATCCTGGTGATGTATTTTATCTTCATTCAAGATTATTAGAAAGATCATCTAAA
GTGAACAAATTTTTTGTAAATAAAAAATCTAATATTTTAAAAGCAGGTTCTTTAACTGCATTTCCTATAA
TTGAAACTTTAGAAGGAGACGTAACTTCTTTTATTCCAACAAATGTTATTTCTATAACTGATGGTCAAAT
TTTTTTAGATACAAATTTATTTAATTCAGGAATTAGACCATCAATAAACGTTGGATTATCTGTTTCTAGA
GTTGGTGGCGCTGCTCAATATAAAATTATTAAAAAATTAAGTGGAGACATTAGAATTATGTTAGCTCAGT
ATAGAGAATTAGAAGCATTTTCTAAATTTTCATCCGATCTTGATAGTGAAACTAAAAATCAATTAATAAT
TGGAGAAAAAATAACAATATTAATGAAACAAAATATACATGATGTTTATGATATATTTGAATTAATATTA
ATATTATTGATAATTAAACATGATTTTTTTAGACTAATTCCAATAAACCAAGTTGAATATTTTGAAAATA
AAATTATAAATTATTTAAGAAAAATTAAATTTAAAAATCAAATTGAAATTGACAACAAAAATTTAGAAAA
TTGTTTAAACGAATTAATAAGTTTTTTTATATCAAACAGTATATTATGATTATTAAAGAAATAAATAGTA
AAATAAAAATAACAACAAATATCAATAAATTAACTAATACTTTGAGTATGATTTCATTGTCTAAAATGAA
TAAATATATAAATTTAATTAATAATTTAGATTATATTAACATTGAATTAAAAAAAATTTTAGAATATATT
ATTATTAACATTAAAAGTAACGTATTTTGTTTAATAATAATTACTTCAAACAAAGGATTGTGTGGAAATT
TAAATAATGAAATTATTAAATACTCGCTTAATTATATTAAAAACAATAAAAATTTAGATTTAATTTTAAT
AGGAAAAAAAGGAATAGATTTTTTTAATAAAAAAAATTTTTATATTAAAGAAAAAATAATTTTTAAAGAC
AATGAATTAAAAAATTTAGTTTTTAATAATAAAATTTTAAATGATTTAAAAAAATACGAAAATATTTTTT
TTATTAGTTCAAAAATTATTAAAAATAACGTTAAAATAATAAAAACAGATTTGTATTTAAAAAAAAAATA
TAATTATTTAATAAAACATAATTTTAATTATGATTGTTTTTTAAAAAATTTTTATAATTATAATTTAAAA
TGTTTGTATTTAAATAACTTGTTTTGTGAATTAAAATCTAGAATGATTACAATGAAGTCTGCTGCTGATA
ATTCAAAAAAAATAATTAAAGACATGAAATTAATAAAAAATAAAATTAGACAATTTAAAGTTACTCAAGA
TATGCTTGAAATAATAAATGGAAGTAATTTATGATAGGAAGAATTGTACAAATTTTAGGTTCTATAGTAG
ACGTTGAATTTAAAAAAAACAATATTCCATATATATATAATGCTTTATTTATTAAAGAATTTAATTTATA
TTTAGAAGTTCAACAACAAATTGGAAATAATATTGTAAGAACTATAGCTTTAGGTAGTACCTATGGATTA
AAAAGATATCTTTTAGTAATAGATACTAAAAAACCAATTTTAACTCCTGTTGGAAATTGTACTTTAGGAC
GTATATTGAATGTTTTAGGTAATCCCATTGATAATAATGGTGAAATTATTTCAAACAAAAAAAAACCAAT
ACATTGTTCACCGCCAAAATTTTCAGATCAAGTATTTTCAAATAATATATTAGAAACTGGAATAAAAGTA
ATAGATTTATTGTGTCCATTTTTAAGAGGAGGAAAAATTGGTTTATTTGGTGGAGCAGGTGTTGGTAAAA
CTATAAATATGATGGAATTAATAAGAAATATTGCAATTGAACATAAAGGATGTTCTGTATTTATAGGAGT
TGGTGAAAGAACTCGTGAAGGAAATGATTTTTATTATGAAATGAAAGAATCAAATGTATTAGACAAAGTT
TCTTTAATATATGGTCAAATGAATGAACCTTCAGGTAATAGATTAAGAGTTGCATTAACTGGATTAAGTA
TAGCAGAAGAATTTAGAGAAATGGGTAAAGATGTACTTTTATTTATAGATAATATTTACAGATTTACGTT
AGCAGGTACTGAAATTTCAGCATTATTGGGAAGAATGCCTTCAGCTGTTGGATATCAGCCTACTTTAGCA
GAAGAAATGGGAAAATTACAAGAAAGAATTTCTTCAACAAAAAATGGAAGTATTACTTCAGTACAAGCTA
TATACGTACCTGCTGATGATTTAACAGATCCATCTCCAAGTACTACTTTTACTCATTTAGATTCTACTAT
TGTTTTGTCTAGACAAATAGCGGAATTAGGAATTTATCCTGCTATTGATCCATTAGAATCTTATTCTAAA
CAATTAGATCCTTATATAGTAGGAATTGAACATTATGAAATTGCTAATTCTGTAAAATTTTATTTACAAA
AATATAAAGAATTAAAAGATACAATAGCTATTTTAGGAATGGACGAATTATCAGAAAATGATCAAATTAT
TGTTAAAAGAGCAAGAAAGTTGCAAAGATTTTTTTCTCAACCTTTTTTTGTTGGTGAAATATTTACAGGA
ATAAAAGGAGAATATGTAAATATAAAAGATACAATTCAATGTTTTAAAAATATTTTAAATGGTGAATTTG
ATAATATTAATGAAAAAAATTTTTATATGATAGGAAAAATATGAATTTATTAATTTTAAGTATAAAAAAT
ATTATAGAATATAAAAATGCTTCTATATTAAATGTAAAAACATACTTAAAACTTTTTTCAATTATGAATA
ATCATATAAATAATATTTGCGATGTTAATCAAATTAAGTTAATATTTAAAAATAAAATCATAAATATAAG
AATTAATAATGGTTTTTTATTTCAAAAAAAAAATAATACTAAAATAATATGTAATTTTTATGAATTTTTA
TAATAAACATATATTAAATGATTTTTCTTTTAAAAAGTATGAAATTTTAACTTTATTTGAAATTAGTAAA
AAAAAAATAAAAAATTTTTTAAATAATAAAAATATTTGTATTTTAAATGATAAAAAATCATTAAGAACAA
TTAATTCACTAATTAATAGTTTTAATTATTTAAATATTAAATATTTGCAAATTTTAAATAATCATAATAT
TAAAAAAGAAAGTTTTAAAGATTTTTCAAGAACAATAGGTTTAAATTTTGATTATTTATATTATAGATGT
TTAAATGACAAAATATTAAAAATTATTGCAAAATATTCAAGTTTAATAATTGTAAACTTATTAAGTAATG
GATATCATCCAATTCAAGCATTAACTGATATTAATAGTTTTTTTTATAATAAAAAAGATGTTTTAATGTA
TATAGGAAATATAACTTCAAATGTAATTAGATCAATAATTATATTATTATCAAAGATAAATTATCTTGTT
GTTTTAATATCACCTATTAAATATTGGTTTAAATTTTTAATAAAAAAAATTTTTCCAAAAAAGAAAATAC
TTATAAGTGAAAAATTAATTTTATTTAAAAAAAAATATTATGTATATACAGATGTTTGGGAATCAATGAA
TAATAAAAATGTAAAAATAACTGATTTTTTAAACTTACAAATTAATAAAAAATTATTTGATTTAATTAAA
ATAAAAAAAGTATTACATTGTATGCCAAGATTTAATAAAAGTTATTTAGATTTTGAAATTTCAAATTTAG
TATTTGAATCAGATTACTTTTTAGTTAATAATTCGATAATTAAAAAAAATAAAATATTTAAAAGTTATAT
TTTTATTAGTAATTCATTTTTTTTTAAAATCATTTAGTTCTTTTAAATTAATATTATAAGATAGTTTGTT
TATATAATCAAAAATTTCATTTTTTTTATATTCAATAATTTTAATAATTTTTTTCATAAACTTAAAATAT
AATTTATTGCATGAAAATATCCATTCTCTTTCATACCTGAAATTACAACATTTGAATTAAGTGAAATTAT
AGAATTTTTTCTATAATATTCTCTATTGTATATATTTGATATATGTAATTCGATAATTTTACCTTTAAAA
ATTTTTATACAATCTAATAAAGCAATTGAATAATGACTATATGCACCTGGATTTATAATAATATAATTAA
AGTTTATATTTTTTTGAATAAAATTAATTATTTTTCCTTCGCAATTTGAATTATAAAATTTAATATTTAT
AATATTTTTTGAGTATTTTAAAATTTTTTTTTTTAATTTTTTAAAAGAAATTTTAGAATAAATTTTTTCT
CTTTTTTTTAAAAAATTAATATTTGGTCCATTTATTATTAATACATTTATAATTTTATTACAAACAAACA
TAGTTTAATTAAAAATTTTTTGTTTAAATTAGTTTTTTTTTTTAGTTCTAGTTCGTTACTAGAATATCCA
AATTTTTTTATGTTTAACACATACGTAAAGTATTTTTTATATTTATACCAAAAATCATCATTTGAAGATT
CAACAAAAATTATTTTTTTACAATTTAAAATTTTGTTTTTATATATTTTTTTTTGTTTATCAAATAATTT
ATTGCAGAATAATGAAATTATTTGAATAATAAAATATTTTTTTAAGAAAAAATAACATTCAAAACAGATT
TCTAAATCTGATCCATTAGAAACAATTATTAATTCTATTTTTTTTTTATAAAAACATGAATAAGTACCAG
TTATAATATTTTTAATATTATATATTTTAATAAAATTGTTTTTAAAATTTTGTCTTGATAAAATTAGAGA
CGAACAATTATTTAAAAATTTCAAAATTAATATCCAACATAGAATTAATTCTATATAATTATATGGTCTA
AATATATAATTTCTTGGTATTATTCTAATTGAATGTAATTGTTCAATTGGTTGATGTGATGGTCCATCTT
CACCAACTAAAATTGAATCATGTGTAAATATAAAAATATTTTTAAGTTTAGATAAACAAAAATTTCTTAT
TGCACTATACATATAATTTGAAAAAACTAAAAAAGTAGAACAATAATTTATTCCTATTTTATCAGAAGAT
AACCCGTAATTTATTAATCCCATTGTAAATTCTCGTACTCCATAATTTATATATCTATTTTTAAAATTTT
TATATCTAATAGAATTAATAAAATTGTTTTTTGTTAAGTTAGAATTTGTTAAATCTGCGCTTCCTCCAAA
TGTTTCATTTATTGCATATATATTTTTTAATATATTAGAACAAACAAATCTAGTAGACTTATTTAAATTT
ATTTTATAGTATTTAAAATATAATTTTAAAAAATTTATTTTTGGTATAATGTTATTAAAAATTCTTATTA
ACTCAAAAAAATATTTTTTATATTTTTTTTTGTAGTATATTAAATATTTTTTTTTATTATCAAAAAACAT
TTTTTTAACATAATCATATGTTAATGTAAAATTTTTTAAAATTTCTAAAAATTCAAATTTTGTAAAAATA
TTTCCATGAGAATTTTCATTATATGATTTACATGGAGAAATAAATCCTATTATAGTATTGTAAATTATAA
TTGTTGGAAAATAACTTTTTTTTGCTTTTAATAAAGATTTAATTATTGAAAAATAGCAATGTCCATTTAT
TGGTCCAATAACATTCCAATTTAATGAAATAAATTTTAACTTAATATTTTCATTAAAATAATTTTTAACA
TTTCCATCTATTGAAATATTATTACTATCATATAATAATATAATATTGTTAATATTATAGCATCCACAAA
AAGAACATGATTCGGAGGACACTCCTTCCATTAAACATCCATCTCCACAAAATATCCAAACTTTATTATT
GAATATATTAAAAAAATTATTAAATTTATTTTTATACTTTTTACTTTTTAAACCAATTCCAATTCCAATT
CCAATTCCTTGTCCTAATGGACCAGTTGAAGCATCAATAAAATTTCCAATTTCAGGATGACCTGGTGTAT
TAGAATTAAACCTTCTAAAATTTATTAAATCTTTTATTTTATATACATTGTATAAATAAAGTAATACATA
ATTTATAATTATTCCATGCCCATTTGAAATTATAAGTTTATCTTTATTAATTGATTTTAAATTGTTAAAA
TTTATTTTATAAAAATTTAAAAAAAAAATCGTAAATACATCACAAATTCCAAGAGGCATACCGGGATGTC
CAGAATTAGCTTTTGAAATTGATTTAATACAAATTAATCTAATATTATTTATTATGTTATATAACATTTT
AAAATTTAAAATTTTTTTTTTCAAAATTTATTCAATTTGTAATTATAAAACAAATACTTTTTCTATTTTA
AATAAAAAAATAAAATATAATTTTTTTTTGAACTTTATTAATTATTATATAAATTATTTAAATTATAACA
ATAAAAAAAAAATTGGAATTTTAATGTATTTTAAAGTATCAAAAGTAATTTCTTCTTTTAACATAGAAAA
AAATGGTATCTTTTTTTTTTCAAACAAGAATGTTTTTTTATATAAAATATTAAAAAATTATGATATAAAC
AATATTTATCACGTAATTAAAATAATTAAAATAAATAAAATAAAGTTTAACTTAAAAATTTTAAAAAAAA
TATTTACAAAAATTTTAAAAAAAAAAAGAAAAGAAGTATATGAAAAATTAGAAGAAAGATATTTAATTAC
AATACTATTAAATAATTTAAACGAAACAAAAAATAAGATTATTAATATTTATAAATCATTAATTAATTAT
AATACTAATAATTTTTTTTTAATTAATAAAGAATTTAACAAAGTATGTTCTTTACTGTATTTAAGTAAAA
ATGAAAGTTTGTCGAAAAAAATTCATTTAGGATTAATAAAAAATAATTTTAAAGAAGAAACTCCTTTTTA
TTTAAATTACATATTTAATTATTTCTTAAAATTTAATGAGCTAAAATTAACAATTTCAATTGAAATTTAT
AACTTAGATATTTTAAAAATAATTAAAACAATCAAAAAAAATAAAAAAATAAAAATTTTCATTAATGTTG
GTATAAATGATTTATTTTTTGAAAAAATTTTTAAAAAAAAAAAAATAATTTTATTTAATTCGTTTAAAAT
AAAAAAAGAATATGGTTATTACGTACAAAATTTTTTTGATGAATATGTTGGATATGGATCATTTAGAAAA
ATGTATTTTAAAATATTTAAAAACAAAAATATTTTTAAGATAAAAATTTGTGCTAAATATTTTTTTTTAA
AAATTTTAAAAACTAAAAATTTAAAAATTTATTTTTTAGATTCTTTAAACAGAAACAATTTAAATAAACA
TATTAGTAATTTACTTACTGGATTTTTTCATCCAAAAATATTTGATAAAAATAATTTTTTTAAAAAAAAA
TATTTTTTTTACAAAAACAATAATATTTTAATAAATAAAAATAATTCTTTTTATTTAGAAATAAAATTTT
TTGTAAATTTTAAAATTTGTAAATATATTAAAAAAAAAATTGTTTTTTTATATAAATTTTTTAACAAAGA
AAGTGAAAATTATATTATAAAAAAAGAAATAAATTTTTGTTTAAATTATCGAATAAAACCAATAACAATT
TATTTTCATGTAGTAAATAAAAAAGTTGAAGAATATATTAATTTTTTAATTTTACAAATTAATTGTAATT
TATCAAAGAAAAATAATTCATATTGTTGGTACTTTGGTAGTAATATTTATAATAGCAATTTTTTTTATAT
TAAAAAATATATATCAAAAAAATGGAATTTTATTATTAAGAAAATCATTTTATTTAAAATAAAAAATTCT
GTTTATTTAAATTTTAAAATTAAAAAAACAAATTTAAAACTAATATCATTAGATAATTTTTTATTAAAAT
TAATAATTAAAAATTGGCAAAAAAAAAATGAAAAATATTAGTTTTGAAATATTTCCTTGTAATAACATTA
AAGACTTATCTGTTTTAATAAATTATTTAAACAAAAATAAACCTAGTTTTGTTTCTGTAACATTTGGAAA
AATCAATAACTTAAAATTTGTTAAAAATATACAAAAACAGATTTCTACAAAAATAATACCACATTTAATA
TGTGATAATATATTTAATATTATTAATTATATAATTTATTTTATTAAAATAAAAATATTTAATTTTTTAA
TAATTACAGGAGACAAAAACAAAAATAATTCTATAAAATATATTTATTTTATTAGATTTTTGTTTGGTCA
TATAATTAAGATAATAACAGGATGTTATTTTGAAAATCACAAATTTTCTAAAAATTTTAAAAACGAAATT
TTATTTCATTATAAAAAAAATAAAATAGGAACTAATATGTGTATTACACAGTTTTTTTATAATTTTAACA
CAATAAAGTATTACATTAATATTATTAAAAAAACTGGTATTAGTAAAAATTTTATATTAGGAATAATTTC
AAAAAAAAATATAAAAGATATTTTAAATTATACTAATTTATGTAAAATAGATATTCCAATTTGGATAATT
AAAAATTATAAAGAATTTAATATTGAACTTTTTTTTGTTAAAAATTTAAAAAAATACAAAAATTTGCATT
TTTATACTTTTAACAATATTAATTTAATTAAAAATTATTTTAAATAAATTTTATTGTTATAAAATAAGTA
TACAAAATAATTAATAATAAAAAAAAATTTTTTATTAATAAAAAAAAAAATTTTTTTTATTAAAAAGTTT
CTAACAAAATTTAAAACATTTACTTTAATCATTTAAATTATTTTAAAAAAAAAAAAAATAAACAATTCAT
TATACTAAAAATAGTTAAAATTTAATTTTTAAATTACTTTATTAAACTTGATATTTTTAAAAAAAAAAAA

… and more

FASTA file of genes

>CRP_004 F0F1-type ATP synthase C subunit 
ATGAATAATTTATTAATATTATCTTCATCAATAATGATAGGATTATCATCTATTGGAACAGGTATAGGAT
TTGGAATTTTAGGAGGAAAACTTTTAGATTCCATATCAAGACAACCAGAATTAGATAATTTATTATTAAC
TAGAACTTTTTTAATGACAGGATTATTAGATGCTATTCCAATGATAAGCGTAGGTATAGGTTTATACTTA
ATATTTGTTTTATCAAATAAATAA
>CRP_005 putative F0F1-type ATP synthase B subunit 
ATGAATTTCAATTATACTATTATTAATGAATTTGTATCTTTTTTAATTTTTTTTTATGTTTCATTTAAAA
TTATATTTCCAGTTATATTAAAAAAAATAAATAATTTTTTAATAATTGATTATAAAAATTTTGTTTTTAA
CAATCAAGAAAAAATTATTAAAAAAAAATTATTAGATGAAATAGTTAAAAACGAAAATTTAACAAATAAG
AAATTTATATCTTTAATAGAAAAAATAAAAAAAAGTATTTTATTAGAAAAACAAAATTTTATTAATTTTA
TAAAATTAGAAAAAATAAACGTTCTAAAAATTTTTAAAAAAAAAATATTAAATAATAATATGTTAATTAT
TAAAAACTTTTTAATTGAGATTAAAAAATTGTTTATAAATAGCTTTAAAAATATTTTTAATGAAATTATT
TGTTATAACAATGAATTTATAATTAATTATGTTTAA
>CRP_006 hypothetical protein 
ATGTTTAAATTTATAAACAGGTTTTTAAATTTAAAAAAAAGATATTTTTATATTTTTTTAATAAATTTTT
TTTATTTTTTTAATAAATGTAATTTTATTAAAAAAAAAAAAATATATAAAAAAATAATTACTAAAAAATT
TGAAAATTATTTATTAAAATTAATTATTCAAAAATATGCTAAATGA
>CRP_007 F0F1-type ATP synthase alpha subunit 
ATGCTAAATGAAGGAATAATAAACAAAATTTATGATAGTGTAGTTGAAGTTCTTGGATTGAAAAATGCTA
AATATGGTGAAATGATTTTATTTAGTAAAAATATTAAAGGAATAGTATTCAGTTTAAACAAAAAAAATGT
AAATATAATTATATTAAATAATTATAACGAGTTAACACAAGGAGAAAAATGTTATTGCACAAACAAAATA
TTTGAAGTTCCTGTTGGAAAACAATTAATAGGTAGAATAATAAATTCTAGAGGAGAAACTCTCGATTTGT
TACCAGAAATTAAAATAAATGAATTTTCACCTATTGAAAAAATAGCACCAGGTGTTATGGATAGAGAAAC
AGTAAATGAGCCATTATTAACTGGAATAAAATCTATTGATTCAATGATTCCTATTGGAAAAGGACAACGA
GAATTAATTATTGGTGATAGACAAACTGGAAAAACTACAATTTGTATTGATACTATTATTAATCAAAAAA
ATAAAAATATTATTTGTGTTTATGTTTGTATAGGTCAAAAAATATCTTCTTTAATAAATATTATTAATAA
GCTTAAAAAATTTAATTGCTTAGAATATACAATTATTGTAGCTTCAACTGCCTCAGATAGTGCAGCGGAG
CAGTATATTGCTCCATATACTGGAAGCACAATAAGTGAATATTTTCGTGATAAAGGACAAGATTGCCTAA
TTGTTTATGATGATTTAACAAAACATGCTTGGGCATATAGACAAATTTCTTTACTATTAAGACGTCCACC
TGGTCGTGAAGCTTATCCTGGTGATGTATTTTATCTTCATTCAAGATTATTAGAAAGATCATCTAAAGTG
AACAAATTTTTTGTAAATAAAAAATCTAATATTTTAAAAGCAGGTTCTTTAACTGCATTTCCTATAATTG
AAACTTTAGAAGGAGACGTAACTTCTTTTATTCCAACAAATGTTATTTCTATAACTGATGGTCAAATTTT
TTTAGATACAAATTTATTTAATTCAGGAATTAGACCATCAATAAACGTTGGATTATCTGTTTCTAGAGTT
GGTGGCGCTGCTCAATATAAAATTATTAAAAAATTAAGTGGAGACATTAGAATTATGTTAGCTCAGTATA
GAGAATTAGAAGCATTTTCTAAATTTTCATCCGATCTTGATAGTGAAACTAAAAATCAATTAATAATTGG
AGAAAAAATAACAATATTAATGAAACAAAATATACATGATGTTTATGATATATTTGAATTAATATTAATA
TTATTGATAATTAAACATGATTTTTTTAGACTAATTCCAATAAACCAAGTTGAATATTTTGAAAATAAAA
TTATAAATTATTTAAGAAAAATTAAATTTAAAAATCAAATTGAAATTGACAACAAAAATTTAGAAAATTG
TTTAAACGAATTAATAAGTTTTTTTATATCAAACAGTATATTATGA
>CRP_008 F0F1-type ATP synthase gamma subunit 
ATGATTATTAAAGAAATAAATAGTAAAATAAAAATAACAACAAATATCAATAAATTAACTAATACTTTGA
GTATGATTTCATTGTCTAAAATGAATAAATATATAAATTTAATTAATAATTTAGATTATATTAACATTGA
ATTAAAAAAAATTTTAGAATATATTATTATTAACATTAAAAGTAACGTATTTTGTTTAATAATAATTACT
TCAAACAAAGGATTGTGTGGAAATTTAAATAATGAAATTATTAAATACTCGCTTAATTATATTAAAAACA
ATAAAAATTTAGATTTAATTTTAATAGGAAAAAAAGGAATAGATTTTTTTAATAAAAAAAATTTTTATAT
TAAAGAAAAAATAATTTTTAAAGACAATGAATTAAAAAATTTAGTTTTTAATAATAAAATTTTAAATGAT
TTAAAAAAATACGAAAATATTTTTTTTATTAGTTCAAAAATTATTAAAAATAACGTTAAAATAATAAAAA
CAGATTTGTATTTAAAAAAAAAATATAATTATTTAATAAAACATAATTTTAATTATGATTGTTTTTTAAA
AAATTTTTATAATTATAATTTAAAATGTTTGTATTTAAATAACTTGTTTTGTGAATTAAAATCTAGAATG
ATTACAATGAAGTCTGCTGCTGATAATTCAAAAAAAATAATTAAAGACATGAAATTAATAAAAAATAAAA
TTAGACAATTTAAAGTTACTCAAGATATGCTTGAAATAATAAATGGAAGTAATTTATGA
>CRP_009 F0F1-type ATP synthase beta subunit 
ATGATAGGAAGAATTGTACAAATTTTAGGTTCTATAGTAGACGTTGAATTTAAAAAAAACAATATTCCAT
ATATATATAATGCTTTATTTATTAAAGAATTTAATTTATATTTAGAAGTTCAACAACAAATTGGAAATAA
TATTGTAAGAACTATAGCTTTAGGTAGTACCTATGGATTAAAAAGATATCTTTTAGTAATAGATACTAAA
AAACCAATTTTAACTCCTGTTGGAAATTGTACTTTAGGACGTATATTGAATGTTTTAGGTAATCCCATTG
ATAATAATGGTGAAATTATTTCAAACAAAAAAAAACCAATACATTGTTCACCGCCAAAATTTTCAGATCA
AGTATTTTCAAATAATATATTAGAAACTGGAATAAAAGTAATAGATTTATTGTGTCCATTTTTAAGAGGA
GGAAAAATTGGTTTATTTGGTGGAGCAGGTGTTGGTAAAACTATAAATATGATGGAATTAATAAGAAATA
TTGCAATTGAACATAAAGGATGTTCTGTATTTATAGGAGTTGGTGAAAGAACTCGTGAAGGAAATGATTT
TTATTATGAAATGAAAGAATCAAATGTATTAGACAAAGTTTCTTTAATATATGGTCAAATGAATGAACCT
TCAGGTAATAGATTAAGAGTTGCATTAACTGGATTAAGTATAGCAGAAGAATTTAGAGAAATGGGTAAAG
ATGTACTTTTATTTATAGATAATATTTACAGATTTACGTTAGCAGGTACTGAAATTTCAGCATTATTGGG
AAGAATGCCTTCAGCTGTTGGATATCAGCCTACTTTAGCAGAAGAAATGGGAAAATTACAAGAAAGAATT
TCTTCAACAAAAAATGGAAGTATTACTTCAGTACAAGCTATATACGTACCTGCTGATGATTTAACAGATC
CATCTCCAAGTACTACTTTTACTCATTTAGATTCTACTATTGTTTTGTCTAGACAAATAGCGGAATTAGG
AATTTATCCTGCTATTGATCCATTAGAATCTTATTCTAAACAATTAGATCCTTATATAGTAGGAATTGAA
CATTATGAAATTGCTAATTCTGTAAAATTTTATTTACAAAAATATAAAGAATTAAAAGATACAATAGCTA
TTTTAGGAATGGACGAATTATCAGAAAATGATCAAATTATTGTTAAAAGAGCAAGAAAGTTGCAAAGATT
TTTTTCTCAACCTTTTTTTGTTGGTGAAATATTTACAGGAATAAAAGGAGAATATGTAAATATAAAAGAT
ACAATTCAATGTTTTAAAAATATTTTAAATGGTGAATTTGATAATATTAATGAAAAAAATTTTTATATGA
TAGGAAAAATATGA
>CRP_010 hypothetical protein 
ATGAATTTATTAATTTTAAGTATAAAAAATATTATAGAATATAAAAATGCTTCTATATTAAATGTAAAAA
CATACTTAAAACTTTTTTCAATTATGAATAATCATATAAATAATATTTGCGATGTTAATCAAATTAAGTT
AATATTTAAAAATAAAATCATAAATATAAGAATTAATAATGGTTTTTTATTTCAAAAAAAAAATAATACT
AAAATAATATGTAATTTTTATGAATTTTTATAA

FASTA file (amino acids)

>CRP_004 F0F1-type ATP synthase C subunit
MNNLLILSSSIMIGLSSIGTGIGFGILGGKLLDSISRQPELDNLLLTRTFLMTGLLDAIPMISVGIGLYL
IFVLSNK
>CRP_005 putative F0F1-type ATP synthase B subunit
MNFNYTIINEFVSFLIFFYVSFKIIFPVILKKINNFLIIDYKNFVFNNQEKIIKKKLLDEIVKNENLTNK
KFISLIEKIKKSILLEKQNFINFIKLEKINVLKIFKKKILNNNMLIIKNFLIEIKKLFINSFKNIFNEII
CYNNEFIINYV
>CRP_006 hypothetical protein
MFKFINRFLNLKKRYFYIFLINFFYFFNKCNFIKKKKIYKKIITKKFENYLLKLIIQKYAK
>CRP_007 F0F1-type ATP synthase alpha subunit
MLNEGIINKIYDSVVEVLGLKNAKYGEMILFSKNIKGIVFSLNKKNVNIIILNNYNELTQGEKCYCTNKI
FEVPVGKQLIGRIINSRGETLDLLPEIKINEFSPIEKIAPGVMDRETVNEPLLTGIKSIDSMIPIGKGQR
ELIIGDRQTGKTTICIDTIINQKNKNIICVYVCIGQKISSLINIINKLKKFNCLEYTIIVASTASDSAAE
QYIAPYTGSTISEYFRDKGQDCLIVYDDLTKHAWAYRQISLLLRRPPGREAYPGDVFYLHSRLLERSSKV
NKFFVNKKSNILKAGSLTAFPIIETLEGDVTSFIPTNVISITDGQIFLDTNLFNSGIRPSINVGLSVSRV
GGAAQYKIIKKLSGDIRIMLAQYRELEAFSKFSSDLDSETKNQLIIGEKITILMKQNIHDVYDIFELILI
LLIIKHDFFRLIPINQVEYFENKIINYLRKIKFKNQIEIDNKNLENCLNELISFFISNSIL
>CRP_008 F0F1-type ATP synthase gamma subunit
MIIKEINSKIKITTNINKLTNTLSMISLSKMNKYINLINNLDYINIELKKILEYIIINIKSNVFCLIIIT
SNKGLCGNLNNEIIKYSLNYIKNNKNLDLILIGKKGIDFFNKKNFYIKEKIIFKDNELKNLVFNNKILND
LKKYENIFFISSKIIKNNVKIIKTDLYLKKKYNYLIKHNFNYDCFLKNFYNYNLKCLYLNNLFCELKSRM
ITMKSAADNSKKIIKDMKLIKNKIRQFKVTQDMLEIINGSNL
>CRP_009 F0F1-type ATP synthase beta subunit
MIGRIVQILGSIVDVEFKKNNIPYIYNALFIKEFNLYLEVQQQIGNNIVRTIALGSTYGLKRYLLVIDTK
KPILTPVGNCTLGRILNVLGNPIDNNGEIISNKKKPIHCSPPKFSDQVFSNNILETGIKVIDLLCPFLRG
GKIGLFGGAGVGKTINMMELIRNIAIEHKGCSVFIGVGERTREGNDFYYEMKESNVLDKVSLIYGQMNEP
SGNRLRVALTGLSIAEEFREMGKDVLLFIDNIYRFTLAGTEISALLGRMPSAVGYQPTLAEEMGKLQERI
SSTKNGSITSVQAIYVPADDLTDPSPSTTFTHLDSTIVLSRQIAELGIYPAIDPLESYSKQLDPYIVGIE
HYEIANSVKFYLQKYKELKDTIAILGMDELSENDQIIVKRARKLQRFFSQPFFVGEIFTGIKGEYVNIKD
TIQCFKNILNGEFDNINEKNFYMIGKI
>CRP_010 hypothetical protein
MNLLILSIKNIIEYKNASILNVKTYLKLFSIMNNHINNICDVNQIKLIFKNKIINIRINNGFLFQKKNNT
KIICNFYEFL

Files in FASTA format are text files

  • one byte, one letter
  • no colors, fonts, or styles
  • universal and for ever

Rules of FASTA files

  • File has one or more sequences
  • Each sequence starts with >
  • The first word after > is the sequence identifier
    • all sequences must have an identifier
  • The rest of the line contains comments, for humans
  • The sequence starts in the second line, until another > or the end of file

Molecular Sequences
and how to find them

In most cases we get our sequences from NCBI

NCBI stores all public biological sequences at
https://www.ncbi.nlm.nih.gov/nuccore

Anybody can upload sequences, and they may be wrong

NCBI has a curation process to validate the sequences

If a sequence is good enough to be a reference, then it is stored in the RefSeq collection https://www.ncbi.nlm.nih.gov/refseq

It is easy if you know the Accession Number

Accession numbers are the best way to identify a biological sequence

  • Candidatus Carsonella ruddii PV DNA has accession AP009180
  • Escherichia coli str. K-12 substr. MG1655 has accession NC_000913

Different sequences can have the same name, but never the same accession

Accession numbers are identifiers

This is an important idea

Everything needs a name. A unique name.

We assign an identifier (or id) to each thing

If two things have the same id, then they are the same thing.

When you find the correct sequence…

 

You download the FASTA file

Store it on your computer, and change the name

Our favorite organisms

In this course we use two sequences for most of the examples

  • Escherichia coli str. K-12 substr. MG1655
    • because it is the model organism for bacteria
  • Candidatus Carsonella ruddii PV DNA
    • because it has a small genome, so it is good for examples

Candidatus Carsonella ruddii has the smallest genome known

We use this genome in classes because it is a small example

Candidatus Carsonella ruddii is a obligate symbiont of Pachpsylla venusta.

It is not clear if this is a living cell or simply an organelle. It is missing genes needed for living independently.

Published as “The 160-kilobase genome of the bacterial endosymbiont Carsonella.” https://www.ncbi.nlm.nih.gov/pubmed/17038615

For simple cases you can use data from my blog

Analyzing Biological Sequences in R

Reading FASTA in R

To handle sequence data in R, we use the seqinr library

You have to install it once.

install.packages("seqinr")

Then you have to load it on every session

library(seqinr)

Read FASTA formatted files

read.fasta(file, seqtype = "DNA", set.attributes = TRUE, ...)

Important inputs of this function

file
The name of the file which the sequences in FASTA format are to be read from
seqtype
Type of sequence: "DNA" or "AA". default is "DNA"
set.attributes
if TRUE, gets extra data. We will choose FALSE

Read FASTA formatted files

library(seqinr)
proteins <- read.fasta("AP009180.faa", seqtype="AA", set.attributes = FALSE)
proteins[1:10]
$`lcl|AP009180.1_prot_BAF35032.1_1`
  [1] "M" "N" "T" "I" "F" "S" "R" "I" "T" "P" "L" "G" "N" "G" "T" "L"
 [17] "C" "V" "I" "R" "I" "S" "G" "K" "N" "V" "K" "F" "L" "I" "Q" "K"
 [33] "I" "V" "K" "K" "N" "I" "K" "E" "K" "I" "A" "T" "F" "S" "K" "L"
 [49] "F" "L" "D" "K" "E" "C" "V" "D" "Y" "A" "M" "I" "I" "F" "F" "K"
 [65] "K" "P" "N" "T" "F" "T" "G" "E" "D" "I" "I" "E" "F" "H" "I" "H"
 [81] "N" "N" "E" "T" "I" "V" "K" "K" "I" "I" "N" "Y" "L" "L" "L" "N"
 [97] "K" "A" "R" "F" "A" "K" "A" "G" "E" "F" "L" "E" "R" "R" "Y" "L"
[113] "N" "G" "K" "I" "S" "L" "I" "E" "C" "E" "L" "I" "N" "N" "K" "I"
[129] "L" "Y" "D" "N" "E" "N" "M" "F" "Q" "L" "T" "K" "N" "S" "E" "K"
[145] "K" "I" "F" "L" "C" "I" "I" "K" "N" "L" "K" "F" "K" "I" "N" "S"
[161] "L" "I" "I" "C" "I" "E" "I" "A" "N" "F" "N" "F" "S" "F" "F" "F"
[177] "F" "N" "D" "F" "L" "F" "I" "K" "Y" "T" "F" "K" "K" "L" "L" "K"
[193] "L" "L" "K" "I" "L" "I" "D" "K" "I" "T" "V" "I" "N" "Y" "L" "K"
[209] "K" "N" "F" "T" "I" "M" "I" "L" "G" "R" "R" "N" "V" "G" "K" "S"
[225] "T" "L" "F" "N" "K" "I" "C" "A" "Q" "Y" "D" "S" "I" "V" "T" "N"
[241] "I" "P" "G" "T" "T" "K" "N" "I" "I" "S" "K" "K" "I" "K" "I" "L"
[257] "S" "K" "K" "I" "K" "M" "M" "D" "T" "A" "G" "L" "K" "I" "R" "T"
[273] "K" "N" "L" "I" "E" "K" "I" "G" "I" "I" "K" "N" "I" "N" "K" "I"
[289] "Y" "Q" "G" "N" "L" "I" "L" "Y" "M" "I" "D" "K" "F" "N" "I" "K"
[305] "N" "I" "F" "F" "N" "I" "P" "I" "D" "F" "I" "D" "K" "I" "K" "L"
[321] "N" "E" "L" "I" "I" "L" "V" "N" "K" "S" "D" "I" "L" "G" "K" "E"
[337] "E" "G" "V" "F" "K" "I" "K" "N" "I" "L" "I" "I" "L" "I" "S" "S"
[353] "K" "N" "G" "T" "F" "I" "K" "N" "L" "K" "C" "F" "I" "N" "K" "I"
[369] "V" "D" "N" "K" "D" "F" "S" "K" "N" "N" "Y" "S" "D" "V" "K" "I"
[385] "L" "F" "N" "K" "F" "S" "F" "F" "Y" "K" "E" "F" "S" "C" "N" "Y"
[401] "D" "L" "V" "L" "S" "K" "L" "I" "D" "F" "Q" "K" "N" "I" "F" "K"
[417] "L" "T" "G" "N" "F" "T" "N" "K" "K" "I" "I" "N" "S" "C" "F" "R"
[433] "N" "F" "C" "I" "G" "K"

$`lcl|AP009180.1_prot_BAF35033.1_2`
  [1] "M" "N" "I" "F" "N" "I" "I" "I" "I" "G" "A" "G" "H" "S" "G" "I"
 [17] "E" "A" "A" "I" "S" "A" "S" "K" "I" "C" "N" "K" "I" "K" "I" "I"
 [33] "T" "S" "N" "L" "E" "N" "L" "G" "I" "M" "S" "C" "N" "P" "S" "I"
 [49] "G" "G" "I" "G" "K" "S" "H" "L" "V" "K" "E" "L" "E" "L" "F" "G"
 [65] "G" "I" "M" "P" "E" "A" "S" "D" "Y" "S" "R" "I" "H" "S" "K" "L"
 [81] "L" "N" "Y" "K" "K" "G" "E" "S" "V" "H" "S" "L" "R" "Y" "Q" "I"
 [97] "D" "R" "I" "L" "Y" "K" "N" "Y" "I" "L" "K" "I" "L" "F" "L" "K"
[113] "K" "N" "I" "L" "I" "E" "Q" "N" "E" "I" "N" "K" "I" "I" "R" "F"
[129] "K" "K" "K" "I" "L" "I" "F" "N" "K" "L" "K" "F" "F" "N" "I" "A"
[145] "K" "I" "I" "I" "V" "C" "A" "G" "T" "F" "I" "N" "S" "K" "I" "Y"
[161] "I" "G" "K" "N" "I" "K" "A" "L" "N" "K" "A" "E" "K" "K" "S" "I"
[177] "S" "Y" "S" "F" "K" "K" "I" "N" "L" "F" "I" "S" "K" "L" "K" "T"
[193] "G" "T" "P" "P" "R" "L" "D" "L" "N" "Y" "L" "N" "Y" "K" "K" "L"
[209] "S" "V" "Q" "Y" "S" "D" "Y" "T" "I" "S" "Y" "G" "K" "N" "F" "N"
[225] "F" "N" "N" "N" "V" "K" "C" "F" "I" "T" "N" "T" "D" "N" "K" "I"
[241] "N" "N" "F" "I" "K" "K" "N" "I" "K" "N" "S" "S" "L" "F" "N" "L"
[257] "K" "F" "K" "S" "I" "G" "P" "R" "Y" "C" "P" "S" "I" "E" "D" "K"
[273] "I" "F" "K" "F" "P" "N" "N" "K" "N" "H" "Q" "I" "F" "L" "E" "P"
[289] "E" "S" "Y" "F" "S" "K" "E" "I" "Y" "V" "N" "G" "L" "S" "N" "S"
[305] "L" "S" "Y" "N" "I" "Q" "K" "K" "L" "I" "K" "K" "I" "L" "G" "I"
[321] "K" "K" "S" "Y" "I" "I" "R" "Y" "A" "Y" "N" "I" "Q" "Y" "D" "Y"
[337] "F" "D" "P" "R" "C" "L" "K" "I" "S" "L" "N" "I" "K" "F" "A" "N"
[353] "N" "I" "F" "L" "A" "G" "Q" "I" "N" "G" "T" "T" "G" "Y" "E" "E"
[369] "A" "S" "S" "Q" "G" "F" "V" "A" "G" "I" "N" "S" "A" "R" "K" "I"
[385] "L" "K" "L" "P" "L" "W" "K" "P" "K" "K" "W" "N" "S" "Y" "I" "G"
[401] "V" "L" "L" "Y" "D" "L" "T" "N" "F" "G" "I" "Q" "E" "P" "Y" "R"
[417] "I" "F" "T" "S" "K" "S" "D" "N" "R" "L" "F" "L" "R" "F" "D" "N"
[433] "A" "I" "F" "R" "L" "I" "N" "I" "S" "Y" "Y" "L" "G" "C" "L" "P"
[449] "I" "V" "K" "F" "K" "Y" "Y" "N" "S" "L" "I" "Y" "K" "F" "Y" "K"
[465] "N" "L" "I" "N" "I" "R" "K" "I" "K" "L" "F" "D" "N" "F" "Y" "L"
[481] "F" "K" "L" "I" "I" "I" "M" "S" "K" "Y" "Y" "G" "Y" "I" "K" "K"
[497] "K" "Y" "F" "K"

$`lcl|AP009180.1_prot_BAF35034.1_3`
  [1] "M" "V" "I" "L" "K" "K" "N" "I" "L" "N" "N" "F" "L" "N" "F" "K"
 [17] "I" "I" "D" "L" "N" "L" "I" "I" "L" "L" "L" "F" "I" "H" "L" "I"
 [33] "V" "F" "Y" "L" "L" "K" "N" "N" "N" "L" "M" "I" "L" "L" "S" "I"
 [49] "Y" "L" "N" "N" "F" "I" "K" "N" "S" "I" "N" "L" "N" "S" "R" "N"
 [65] "I" "I" "F" "F" "F" "S" "L" "V" "L" "F" "N" "I" "I" "L" "F" "S"
 [81] "N" "F" "I" "D" "L" "F" "P" "N" "N" "L" "I" "K" "N" "F" "L" "N"
 [97] "L" "K" "Q" "I" "E" "I" "V" "P" "T" "S" "N" "I" "N" "I" "T" "F"
[113] "C" "F" "S" "I" "I" "S" "F" "L" "I" "I" "I" "M" "L" "T" "H" "K"
[129] "K" "I" "G" "F" "K" "K" "Y" "I" "Y" "S" "F" "F" "I" "Y" "P" "I"
[145] "N" "T" "E" "Y" "L" "Y" "L" "F" "N" "F" "I" "I" "E" "S" "I" "S"
[161] "Y" "I" "M" "K" "P" "I" "S" "L" "S" "L" "R" "L" "F" "G" "N" "I"
[177] "F" "S" "S" "E" "I" "I" "F" "N" "I" "I" "N" "N" "M" "N" "V" "F"
[193] "I" "N" "S" "F" "L" "N" "L" "I" "W" "G" "I" "F" "H" "F" "I" "I"
[209] "L" "P" "L" "Q" "S" "F" "I" "F" "I" "T" "L" "V" "I" "I" "Y" "V"
[225] "S" "Q" "T" "L" "N" "H"

$`lcl|AP009180.1_prot_BAF35035.1_4`
 [1] "M" "N" "N" "L" "L" "I" "L" "S" "S" "S" "I" "M" "I" "G" "L" "S"
[17] "S" "I" "G" "T" "G" "I" "G" "F" "G" "I" "L" "G" "G" "K" "L" "L"
[33] "D" "S" "I" "S" "R" "Q" "P" "E" "L" "D" "N" "L" "L" "L" "T" "R"
[49] "T" "F" "L" "M" "T" "G" "L" "L" "D" "A" "I" "P" "M" "I" "S" "V"
[65] "G" "I" "G" "L" "Y" "L" "I" "F" "V" "L" "S" "N" "K"

$`lcl|AP009180.1_prot_BAF35036.1_5`
  [1] "M" "N" "F" "N" "Y" "T" "I" "I" "N" "E" "F" "V" "S" "F" "L" "I"
 [17] "F" "F" "Y" "V" "S" "F" "K" "I" "I" "F" "P" "V" "I" "L" "K" "K"
 [33] "I" "N" "N" "F" "L" "I" "I" "D" "Y" "K" "N" "F" "V" "F" "N" "N"
 [49] "Q" "E" "K" "I" "I" "K" "K" "K" "L" "L" "D" "E" "I" "V" "K" "N"
 [65] "E" "N" "L" "T" "N" "K" "K" "F" "I" "S" "L" "I" "E" "K" "I" "K"
 [81] "K" "S" "I" "L" "L" "E" "K" "Q" "N" "F" "I" "N" "F" "I" "K" "L"
 [97] "E" "K" "I" "N" "V" "L" "K" "I" "F" "K" "K" "K" "I" "L" "N" "N"
[113] "N" "M" "L" "I" "I" "K" "N" "F" "L" "I" "E" "I" "K" "K" "L" "F"
[129] "I" "N" "S" "F" "K" "N" "I" "F" "N" "E" "I" "I" "C" "Y" "N" "N"
[145] "E" "F" "I" "I" "N" "Y" "V"

$`lcl|AP009180.1_prot_BAF35037.1_6`
 [1] "M" "F" "K" "F" "I" "N" "R" "F" "L" "N" "L" "K" "K" "R" "Y" "F"
[17] "Y" "I" "F" "L" "I" "N" "F" "F" "Y" "F" "F" "N" "K" "C" "N" "F"
[33] "I" "K" "K" "K" "K" "I" "Y" "K" "K" "I" "I" "T" "K" "K" "F" "E"
[49] "N" "Y" "L" "L" "K" "L" "I" "I" "Q" "K" "Y" "A" "K"

$`lcl|AP009180.1_prot_BAF35038.1_7`
  [1] "M" "L" "N" "E" "G" "I" "I" "N" "K" "I" "Y" "D" "S" "V" "V" "E"
 [17] "V" "L" "G" "L" "K" "N" "A" "K" "Y" "G" "E" "M" "I" "L" "F" "S"
 [33] "K" "N" "I" "K" "G" "I" "V" "F" "S" "L" "N" "K" "K" "N" "V" "N"
 [49] "I" "I" "I" "L" "N" "N" "Y" "N" "E" "L" "T" "Q" "G" "E" "K" "C"
 [65] "Y" "C" "T" "N" "K" "I" "F" "E" "V" "P" "V" "G" "K" "Q" "L" "I"
 [81] "G" "R" "I" "I" "N" "S" "R" "G" "E" "T" "L" "D" "L" "L" "P" "E"
 [97] "I" "K" "I" "N" "E" "F" "S" "P" "I" "E" "K" "I" "A" "P" "G" "V"
[113] "M" "D" "R" "E" "T" "V" "N" "E" "P" "L" "L" "T" "G" "I" "K" "S"
[129] "I" "D" "S" "M" "I" "P" "I" "G" "K" "G" "Q" "R" "E" "L" "I" "I"
[145] "G" "D" "R" "Q" "T" "G" "K" "T" "T" "I" "C" "I" "D" "T" "I" "I"
[161] "N" "Q" "K" "N" "K" "N" "I" "I" "C" "V" "Y" "V" "C" "I" "G" "Q"
[177] "K" "I" "S" "S" "L" "I" "N" "I" "I" "N" "K" "L" "K" "K" "F" "N"
[193] "C" "L" "E" "Y" "T" "I" "I" "V" "A" "S" "T" "A" "S" "D" "S" "A"
[209] "A" "E" "Q" "Y" "I" "A" "P" "Y" "T" "G" "S" "T" "I" "S" "E" "Y"
[225] "F" "R" "D" "K" "G" "Q" "D" "C" "L" "I" "V" "Y" "D" "D" "L" "T"
[241] "K" "H" "A" "W" "A" "Y" "R" "Q" "I" "S" "L" "L" "L" "R" "R" "P"
[257] "P" "G" "R" "E" "A" "Y" "P" "G" "D" "V" "F" "Y" "L" "H" "S" "R"
[273] "L" "L" "E" "R" "S" "S" "K" "V" "N" "K" "F" "F" "V" "N" "K" "K"
[289] "S" "N" "I" "L" "K" "A" "G" "S" "L" "T" "A" "F" "P" "I" "I" "E"
[305] "T" "L" "E" "G" "D" "V" "T" "S" "F" "I" "P" "T" "N" "V" "I" "S"
[321] "I" "T" "D" "G" "Q" "I" "F" "L" "D" "T" "N" "L" "F" "N" "S" "G"
[337] "I" "R" "P" "S" "I" "N" "V" "G" "L" "S" "V" "S" "R" "V" "G" "G"
[353] "A" "A" "Q" "Y" "K" "I" "I" "K" "K" "L" "S" "G" "D" "I" "R" "I"
[369] "M" "L" "A" "Q" "Y" "R" "E" "L" "E" "A" "F" "S" "K" "F" "S" "S"
[385] "D" "L" "D" "S" "E" "T" "K" "N" "Q" "L" "I" "I" "G" "E" "K" "I"
[401] "T" "I" "L" "M" "K" "Q" "N" "I" "H" "D" "V" "Y" "D" "I" "F" "E"
[417] "L" "I" "L" "I" "L" "L" "I" "I" "K" "H" "D" "F" "F" "R" "L" "I"
[433] "P" "I" "N" "Q" "V" "E" "Y" "F" "E" "N" "K" "I" "I" "N" "Y" "L"
[449] "R" "K" "I" "K" "F" "K" "N" "Q" "I" "E" "I" "D" "N" "K" "N" "L"
[465] "E" "N" "C" "L" "N" "E" "L" "I" "S" "F" "F" "I" "S" "N" "S" "I"
[481] "L"

$`lcl|AP009180.1_prot_BAF35039.1_8`
  [1] "M" "I" "I" "K" "E" "I" "N" "S" "K" "I" "K" "I" "T" "T" "N" "I"
 [17] "N" "K" "L" "T" "N" "T" "L" "S" "M" "I" "S" "L" "S" "K" "M" "N"
 [33] "K" "Y" "I" "N" "L" "I" "N" "N" "L" "D" "Y" "I" "N" "I" "E" "L"
 [49] "K" "K" "I" "L" "E" "Y" "I" "I" "I" "N" "I" "K" "S" "N" "V" "F"
 [65] "C" "L" "I" "I" "I" "T" "S" "N" "K" "G" "L" "C" "G" "N" "L" "N"
 [81] "N" "E" "I" "I" "K" "Y" "S" "L" "N" "Y" "I" "K" "N" "N" "K" "N"
 [97] "L" "D" "L" "I" "L" "I" "G" "K" "K" "G" "I" "D" "F" "F" "N" "K"
[113] "K" "N" "F" "Y" "I" "K" "E" "K" "I" "I" "F" "K" "D" "N" "E" "L"
[129] "K" "N" "L" "V" "F" "N" "N" "K" "I" "L" "N" "D" "L" "K" "K" "Y"
[145] "E" "N" "I" "F" "F" "I" "S" "S" "K" "I" "I" "K" "N" "N" "V" "K"
[161] "I" "I" "K" "T" "D" "L" "Y" "L" "K" "K" "K" "Y" "N" "Y" "L" "I"
[177] "K" "H" "N" "F" "N" "Y" "D" "C" "F" "L" "K" "N" "F" "Y" "N" "Y"
[193] "N" "L" "K" "C" "L" "Y" "L" "N" "N" "L" "F" "C" "E" "L" "K" "S"
[209] "R" "M" "I" "T" "M" "K" "S" "A" "A" "D" "N" "S" "K" "K" "I" "I"
[225] "K" "D" "M" "K" "L" "I" "K" "N" "K" "I" "R" "Q" "F" "K" "V" "T"
[241] "Q" "D" "M" "L" "E" "I" "I" "N" "G" "S" "N" "L"

$`lcl|AP009180.1_prot_BAF35040.1_9`
  [1] "M" "I" "G" "R" "I" "V" "Q" "I" "L" "G" "S" "I" "V" "D" "V" "E"
 [17] "F" "K" "K" "N" "N" "I" "P" "Y" "I" "Y" "N" "A" "L" "F" "I" "K"
 [33] "E" "F" "N" "L" "Y" "L" "E" "V" "Q" "Q" "Q" "I" "G" "N" "N" "I"
 [49] "V" "R" "T" "I" "A" "L" "G" "S" "T" "Y" "G" "L" "K" "R" "Y" "L"
 [65] "L" "V" "I" "D" "T" "K" "K" "P" "I" "L" "T" "P" "V" "G" "N" "C"
 [81] "T" "L" "G" "R" "I" "L" "N" "V" "L" "G" "N" "P" "I" "D" "N" "N"
 [97] "G" "E" "I" "I" "S" "N" "K" "K" "K" "P" "I" "H" "C" "S" "P" "P"
[113] "K" "F" "S" "D" "Q" "V" "F" "S" "N" "N" "I" "L" "E" "T" "G" "I"
[129] "K" "V" "I" "D" "L" "L" "C" "P" "F" "L" "R" "G" "G" "K" "I" "G"
[145] "L" "F" "G" "G" "A" "G" "V" "G" "K" "T" "I" "N" "M" "M" "E" "L"
[161] "I" "R" "N" "I" "A" "I" "E" "H" "K" "G" "C" "S" "V" "F" "I" "G"
[177] "V" "G" "E" "R" "T" "R" "E" "G" "N" "D" "F" "Y" "Y" "E" "M" "K"
[193] "E" "S" "N" "V" "L" "D" "K" "V" "S" "L" "I" "Y" "G" "Q" "M" "N"
[209] "E" "P" "S" "G" "N" "R" "L" "R" "V" "A" "L" "T" "G" "L" "S" "I"
[225] "A" "E" "E" "F" "R" "E" "M" "G" "K" "D" "V" "L" "L" "F" "I" "D"
[241] "N" "I" "Y" "R" "F" "T" "L" "A" "G" "T" "E" "I" "S" "A" "L" "L"
[257] "G" "R" "M" "P" "S" "A" "V" "G" "Y" "Q" "P" "T" "L" "A" "E" "E"
[273] "M" "G" "K" "L" "Q" "E" "R" "I" "S" "S" "T" "K" "N" "G" "S" "I"
[289] "T" "S" "V" "Q" "A" "I" "Y" "V" "P" "A" "D" "D" "L" "T" "D" "P"
[305] "S" "P" "S" "T" "T" "F" "T" "H" "L" "D" "S" "T" "I" "V" "L" "S"
[321] "R" "Q" "I" "A" "E" "L" "G" "I" "Y" "P" "A" "I" "D" "P" "L" "E"
[337] "S" "Y" "S" "K" "Q" "L" "D" "P" "Y" "I" "V" "G" "I" "E" "H" "Y"
[353] "E" "I" "A" "N" "S" "V" "K" "F" "Y" "L" "Q" "K" "Y" "K" "E" "L"
[369] "K" "D" "T" "I" "A" "I" "L" "G" "M" "D" "E" "L" "S" "E" "N" "D"
[385] "Q" "I" "I" "V" "K" "R" "A" "R" "K" "L" "Q" "R" "F" "F" "S" "Q"
[401] "P" "F" "F" "V" "G" "E" "I" "F" "T" "G" "I" "K" "G" "E" "Y" "V"
[417] "N" "I" "K" "D" "T" "I" "Q" "C" "F" "K" "N" "I" "L" "N" "G" "E"
[433] "F" "D" "N" "I" "N" "E" "K" "N" "F" "Y" "M" "I" "G" "K" "I"

$`lcl|AP009180.1_prot_BAF35041.1_10`
 [1] "M" "N" "L" "L" "I" "L" "S" "I" "K" "N" "I" "I" "E" "Y" "K" "N"
[17] "A" "S" "I" "L" "N" "V" "K" "T" "Y" "L" "K" "L" "F" "S" "I" "M"
[33] "N" "N" "H" "I" "N" "N" "I" "C" "D" "V" "N" "Q" "I" "K" "L" "I"
[49] "F" "K" "N" "K" "I" "I" "N" "I" "R" "I" "N" "N" "G" "F" "L" "F"
[65] "Q" "K" "K" "N" "N" "T" "K" "I" "I" "C" "N" "F" "Y" "E" "F" "L"

Output of read.fasta()

A list of vectors of chars. Each element is a sequence object.

The first sequence is

proteins[[1]]
  [1] "M" "N" "T" "I" "F" "S" "R" "I" "T" "P" "L" "G" "N" "G" "T" "L"
 [17] "C" "V" "I" "R" "I" "S" "G" "K" "N" "V" "K" "F" "L" "I" "Q" "K"
 [33] "I" "V" "K" "K" "N" "I" "K" "E" "K" "I" "A" "T" "F" "S" "K" "L"
 [49] "F" "L" "D" "K" "E" "C" "V" "D" "Y" "A" "M" "I" "I" "F" "F" "K"
 [65] "K" "P" "N" "T" "F" "T" "G" "E" "D" "I" "I" "E" "F" "H" "I" "H"
 [81] "N" "N" "E" "T" "I" "V" "K" "K" "I" "I" "N" "Y" "L" "L" "L" "N"
 [97] "K" "A" "R" "F" "A" "K" "A" "G" "E" "F" "L" "E" "R" "R" "Y" "L"
[113] "N" "G" "K" "I" "S" "L" "I" "E" "C" "E" "L" "I" "N" "N" "K" "I"
[129] "L" "Y" "D" "N" "E" "N" "M" "F" "Q" "L" "T" "K" "N" "S" "E" "K"
[145] "K" "I" "F" "L" "C" "I" "I" "K" "N" "L" "K" "F" "K" "I" "N" "S"
[161] "L" "I" "I" "C" "I" "E" "I" "A" "N" "F" "N" "F" "S" "F" "F" "F"
[177] "F" "N" "D" "F" "L" "F" "I" "K" "Y" "T" "F" "K" "K" "L" "L" "K"
[193] "L" "L" "K" "I" "L" "I" "D" "K" "I" "T" "V" "I" "N" "Y" "L" "K"
[209] "K" "N" "F" "T" "I" "M" "I" "L" "G" "R" "R" "N" "V" "G" "K" "S"
[225] "T" "L" "F" "N" "K" "I" "C" "A" "Q" "Y" "D" "S" "I" "V" "T" "N"
[241] "I" "P" "G" "T" "T" "K" "N" "I" "I" "S" "K" "K" "I" "K" "I" "L"
[257] "S" "K" "K" "I" "K" "M" "M" "D" "T" "A" "G" "L" "K" "I" "R" "T"
[273] "K" "N" "L" "I" "E" "K" "I" "G" "I" "I" "K" "N" "I" "N" "K" "I"
[289] "Y" "Q" "G" "N" "L" "I" "L" "Y" "M" "I" "D" "K" "F" "N" "I" "K"
[305] "N" "I" "F" "F" "N" "I" "P" "I" "D" "F" "I" "D" "K" "I" "K" "L"
[321] "N" "E" "L" "I" "I" "L" "V" "N" "K" "S" "D" "I" "L" "G" "K" "E"
[337] "E" "G" "V" "F" "K" "I" "K" "N" "I" "L" "I" "I" "L" "I" "S" "S"
[353] "K" "N" "G" "T" "F" "I" "K" "N" "L" "K" "C" "F" "I" "N" "K" "I"
[369] "V" "D" "N" "K" "D" "F" "S" "K" "N" "N" "Y" "S" "D" "V" "K" "I"
[385] "L" "F" "N" "K" "F" "S" "F" "F" "Y" "K" "E" "F" "S" "C" "N" "Y"
[401] "D" "L" "V" "L" "S" "K" "L" "I" "D" "F" "Q" "K" "N" "I" "F" "K"
[417] "L" "T" "G" "N" "F" "T" "N" "K" "K" "I" "I" "N" "S" "C" "F" "R"
[433] "N" "F" "C" "I" "G" "K"

Not all data is a data frame

Beyond vectors and data frames

Last semester we used two data structures

  • vectors
  • data frames, or tibbles

Now we introduce a new data type: lists

Lists

Like vectors, but mixing different kinds of elements

people <- list(c(60, 72, 57, 90, 95, 72),
               c(1.75, 1.80, 1.65, 1.90, 1.74, 1.91),
               c("Ali", "Deniz", "Fatma", "Emre",
                 "Volkan", "Onur"),
               TRUE, c(2017, 10, 10),
               factor(c("M","F","F","M","M","M")))

Notice that elements can have different length

Result

people
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

[[3]]
[1] "Ali"    "Deniz"  "Fatma"  "Emre"   "Volkan" "Onur"  

[[4]]
[1] TRUE

[[5]]
[1] 2017   10   10

[[6]]
[1] M F F M M M
Levels: F M

Visualization

Each list element starts with a number in double brackets

Inside each element, we can see vectors, lists or other things

When the element is a vector, we see a second number, in single brackets

[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Indexing Lists

  • Can be indexed same as vectors
  • Returns a sub-list
people[1:2]
[[1]]
[1] 60 72 57 90 95 72

[[2]]
[1] 1.75 1.80 1.65 1.90 1.74 1.91

Elements versus sublists

This is a sublist (with one element):

people[1]
[[1]]
[1] 60 72 57 90 95 72

This is an element:

people[[1]]
[1] 60 72 57 90 95 72

What have we learned?

What have we learned?

  • Some organisms have been sequenced
  • We can read the sequences from FASTA files
  • read.fasta returns a list of vectors of characters