Things to do
- Read the class slides
- Read the relevant papers
- Watch the videos of our classes and from NCBI
- Understand the methodology of this course
- Check the prerequisites and the syllabus
- Register in the Forum
This course teach how to interpret and understand the results of bioinformatic analyses. Most molecular biologists will work in team with (or hire) bioinformatic teams, so even if they do not use the tools, all molecular biologists need to understand what is the meaning of the results. It is important to speak the same language, and be aware of the key aspects that can lead to the experiment’s success or failure.
Classes
Here you will find the slides from the classes and other supplementary material. Notice that some things are said but not written, so you better take good notes. We recommend taking notes with pen and paper using the Cornell Method.
- Class 1: Why do we care about Bioinformatics?.
(Oct 23, 2020). [Video],[Slides].
What is and what is not Bioinformatics. What will we do here - Class 2: Finding data online. (Oct 23,
2020). [Video],[Slides].
What is and what is not Bioinformatics. Finding data online - Class 3: NCBI Entrez. (Oct 30, 2020).
[Video],[Slides].
Looking inside NCBI website. Complex queries See also:- For the complete story, read the Entrez Help.
- Class 4: NCBI Taxonomy. (Oct 30, 2020).
[Video],[Slides].
Looking inside NCBI website. Complex queries - Class 5: Searching and comparing sequences.
(Nov 6, 2020). [Video],[Slides].
Distance between sequences. Hamming and Levenstein. Global and semi-global alignment. See also:- During the class we created this Google Sheet.
- Be aware that Excel (or Google Sheets) are not recommended to handling data: UK Government loses data because of Excel mistake.
- You may also want to see How to read text into Excel when “comma” is your decimal separator.
- Class 6: Scoring Global and Local Alignments.
(Nov 13, 2020). [Video],[Slides].
Looking for local matches is different from global ones. We need to use scores. They make more biological sense. See also:- Needleman, Saul B., and Christian D. Wunsch. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.”. Journal of Molecular Biology 48, no. 3 (1970): 443–53.
- Smith, T. F., and M. S. Waterman. “Identification of Common Molecular Subsequences.”. Journal of Molecular Biology 147, no. 1 (1981): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.
- Dayhoff, Mo, and Rm Schwartz. “A Model of Evolutionary Change in Proteins.”. In Atlas of Protein Sequence and Structure. Washington, DC: National Biomedical Research Foundation, 1978. https://doi.org/10.1.1.145.4315.
- Henikoff, S, and J G Henikoff. “Amino Acid Substitution Matrices from Protein Blocks.”. Proc Natl Acad Sci 89 (1992). https://doi.org/10.1073/pnas.89.22.10915.
- Class 7: BLAST. (Nov 13, 2020). [Video],[Slides].
It is not like Google. Results depends on the options you choose. What are the options? See also:- Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. “Basic Local Alignment Search Tool.”. Journal of Molecular Biology 215, no. 3 (October 5, 1990): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.
- NCBI BLAST topics.
- NCBI BLAST documentation.
- Five Teaching Examples Using BLAST. (29:37)
- Using BLAST Well. (43:53)
- BLAST Results: Expect Values, Part 1. (02:30)
- BLAST Results: Expect Values, Part 2. (03:39)
- Introducing a New Web BLAST Results Page. (03:13)
- Getting the Most out of Web BLAST Tabular Format. (27:06)
- A Practical Guide to NCBI BLAST. (1:22:09)
- Using BLAST for Genomic Analysis.
- Class 8: Cost, Indices, Heuristics. (Nov 20,
2020). [Video],[Slides].
Solving the easy question, not the correct one See also:- Practice Using BLAST. (1:09:20)
- Class 9: Multiple Sequence Alignment. (Nov 27,
2020). [Video],[Slides].
What is conserved among several sequences? What are the polymorphisms? How to find patterns without aligning. See also: - Class 10: Phylogenetic Trees. (Dec 4,
2020). [Video],[Slides].
Building a time machine, and failing. - Class 11: Finding Binding Sites. (Dec 25,
2020). [Video],[Slides].
How to find patterns without aligning. - Class 12: Assign Taxonomy without Aligning.
(Dec 25, 2020). [Video],[Slides].
Bioidentification of metagenomic samples without aligning. - Class 13: DNA Sequencing and Genome Assembly.
(Jan 8, 2021). [Video],[Slides].
How can we know the genome of an organism? See also:- “Introduction to Bioinformatics”, Lecture by Yuzhen Ye, Indiana University Bloomington.
- Network Algorithms for Molecular Biology lesson on “Introduction to (de novo) assembly”, by Blerina Sinaimeri, Université Lyon I.
- “Foundations of Computational Systems Biology”, Lecture by David K. Gifford, MIT.
- Chou, H.-H., and M. H. Holmes. “DNA Sequence Quality Trimming and Vector Removal.” Bioinformatics 17, no. 12 (2001): 1093–1104..
- Class 14: NGS Genome Assembly & Statistics.
(Jan 15, 2021). [Video],[Slides].
How can we know the genome of an organism? - Class 15: Mapping reads to a reference. (Jan
15, 2021). [Video],[Slides].
How can we know the genome of an organism? - Class 16: Gene expression analysis. (Jan 22,
2021). [Video],[Slides].
qPCR, microarrays, RNAseq, and why we need to learn statistics.
Bibliography
These are some of the papers we want to read and understand during this semester. The most important ones are marked in bold face. Start by reading those ones.
Alignment
Needleman, Saul B., and Christian D. Wunsch. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” Journal of Molecular Biology 48, no. 3 (1970): 443–53.
Smith, T. F., and M. S. Waterman. “Identification of Common Molecular Subsequences.” Journal of Molecular Biology 147, no. 1 (1981): 195–97. https://doi.org/10.1016/0022-2836(81)90087-5.
Karlin, S, and S F Altschul. “Methods for Assessing the Statistical Significance of Molecular Sequence Features by Using General Scoring Schemes.” Proceedings of the National Academy of Sciences of the United States of America 87, no. 6 (1990): 2264–68. https://doi.org/10.1073/pnas.87.6.2264.
Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215, no. 3 (October 5, 1990): 403–10. https://doi.org/10.1016/S0022-2836(05)80360-2.
Karlin, S, and S F Altschul. “Applications and Statistics for Multiple High-Scoring Segments in Molecular Sequences.” Proceedings of the National Academy of Sciences of the United States of America 90, no. 12 (June 15, 1993): 5873–77.
Altschul, S. F., and W. Gish. “Local Alignment Statistics.” Methods in Enzymology 266, no. January (1996): 460–80. https://doi.org/10.1016/S0076-6879(96)66029-7.
Altschul, S F, T L Madden, A A Schäffer, J Zhang, Z Zhang, W Miller, and D J Lipman. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Research 25, no. 17 (September 1, 1997): 3389–3402.
The Statistics of Local Pairwise Sequence Alignment, Part 1 YouTube video.
The Statistics of Local Pairwise Sequence Alignment, Part 2 YouTube video.
Protein Alignment
Dayhoff, Mo, and Rm Schwartz. “A Model of Evolutionary Change in Proteins.” In Atlas of Protein Sequence and Structure. Washington, DC: National Biomedical Research Foundation, 1978. https://doi.org/10.1.1.145.4315.
Altschul, S F. “Amino Acid Substitution Matrices from an Information Theoretic Perspective.” J Mol Biol 219 (1991). https://doi.org/10.1016/0022-2836(91)90193-A.
Henikoff, S, and J G Henikoff. “Amino Acid Substitution Matrices from Protein Blocks.” Proc Natl Acad Sci 89 (1992). https://doi.org/10.1073/pnas.89.22.10915.
Henikoff, S, and J G Henikoff. “Performance Evaluation of of Amino Acid Substitution Matrices.” Proteins 17 (1993): 49–61.
Zhang, Z, A A Schäffer, W Miller, T L Madden, D J Lipman, E V Koonin, and S F Altschul. “Protein Sequence Similarity Searches Using Patterns as Seeds.” Nucleic Acids Research 26, no. 17 (September 1, 1998): 3986–90.
R package Biostring (part of Bioconductor), containing PAM and BLOSUM matrices.
Sequencing
- Chou, H.-H., and M. H. Holmes. “DNA Sequence Quality Trimming and Vector Removal.” Bioinformatics 17, no. 12 (2001): 1093–1104. https://doi.org/10.1093/bioinformatics/17.12.1093.
Assembly
Staden, R. “A Strategy of DNA Sequencing Employing Computer Programs.” Nucleic Acids Research 6, no. 7 (1979): 2601–10. https://doi.org/10.1093/nar/6.7.2601.
Lander, E S, and M S Waterman. “Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis.” Genomics 2, no. 3 (April 1, 1988): 231–39. https://doi.org/10.1016/0888-7543(88)90007-9.
Pevzner, P A, H Tang, and M S Waterman. “An Eulerian Path Approach to DNA Fragment Assembly.” Proceedings of the National Academy of Sciences of the United States of America 98, no. 17 (August 14, 2001): 9748–53.
Chaisson, M, D Brinza, and P Pevzner. “De Novo Fragment Assembly with Short Mate-Paired Reads: Does the Read Length Matter?” Genome Research, December 3, 2008, 25.
Sims, David, Ian Sudbery, Nicholas E. Ilott, Andreas Heger, and Chris P. Ponting. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15, no. 2 (2014): 121–32. https://doi.org/10.1038/nrg3642.
Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey a. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, et al. “SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19, no. 5 (2012): 455–77. https://doi.org/10.1089/cmb.2012.0021.
Li, Zhenyu, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, et al. “Comparison of the Two Major Classes of Assembly Algorithms: Overlap-Layout-Consensus and de-Bruijn-Graph.” Briefings in Functional Genomics 11, no. 1 (2012): 25–37. https://doi.org/10.1093/bfgp/elr035.
Nagarajan, Niranjan, and Mihai Pop. “Sequence Assembly Demystified.” Nature Reviews. Genetics 14, no. 3 (2013): 157–67. https://doi.org/10.1038/nrg3367.
Wick, Ryan R., Mark B. Schultz, Justin Zobel, and Kathryn E. Holt. “Bandage: Interactive Visualization of de Novo Genome Assemblies.” Bioinformatics 31, no. 20 (2015): 3350–52. https://doi.org/10.1093/bioinformatics/btv383.
Phillippy, Adam M. “New Advances in Sequence Assembly.” Genome Research 27, no. 5 (May 1, 2017): xi–xiii. https://doi.org/10.1101/gr.223057.117.
Metagenomics
Dina Fine Maron. “Dirty Money.” Scientific American, 2017. https://www.scientificamerican.com/article/dirty-money/.
Jeff Leach. “Going Feral: My One-Year Journey to Acquire the Healthiest Gut Microbiome in the World,” January 2014. http://humanfoodproject.com/going-feral-one-year-journey-acquire-healthiest-gut-microbiome-world-heard/.
Tyson, Gene W, Jarrod Chapman, Philip Hugenholtz, Eric E Allen, Rachna J Ram, Paul M Richardson, Victor V Solovyev, Edward M Rubin, Daniel S Rokhsar, and Jillian F Banfield. “Community Structure and Metabolism through Reconstruction of Microbial Genomes from the Environment.” Nature 428, no. 6978 (2004): 37–43. https://doi.org/10.1038/nature02340.
Qin, Junjie, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam, Kristoffer Solvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen, et al. “A Human Gut Microbial Gene Catalogue Established by Metagenomic Sequencing.” Nature 464, no. 7285 (March 4, 2010): 59–65. https://doi.org/10.1038/nature08821.
Ünal, Burcu. “Phylogenetic Analysis of Bacterial Communities in Kefir by Metagenomics.” Izmir Institute of Technology, Turkey, 2008.
Ünal, Burcu, and Alper Arslanoğlu. “Phylogenetic Identification of Bacteria within Kefir by Both Culture-Dependent and Culture-Independent Methods.” African Journal of Microbiology Research 7, no. 36 (2013): 4533–38. https://doi.org/10.5897/AJMR2013.6064.
Handelsman, Jo. “Metagenomics: Application of Genomics to Uncultured Microorganisms.” Microbiology and Molecular Biology Reviews 68, no. 4 (2004): 669–85. https://doi.org/10.1128/MMBR.68.4.669-685.2004.
Baker, Brett J., and Jillian F. Banfield. “Microbial Communities in Acid Mine Drainage.” FEMS Microbiology Ecology 44, no. 2 (2003): 139–52. https://doi.org/10.1016/S0168-6496(03)00028-X.
Wooley, John C., and Yuzhen Ye. “Metagenomics: Facts and Artifacts, and Computational Challenges.” Journal of Computer Science and Technology 25, no. 1 (2009): 71–81. https://doi.org/10.1007/s11390-010-9306-4.
Sharpton, Thomas J. “An Introduction to the Analysis of Shotgun Metagenomic Data.” Frontiers in Plant Science 5 (June 16, 2014): 209. https://doi.org/10.3389/fpls.2014.00209.
Hunter, Chris I, Alex Mitchell, Philip Jones, Craig McAnulla, Sebastien Pesseat, Maxim Scheremetjew, and Sarah Hunter. “Metagenomic Analysis: The Challenge of the Data Bonanza.” Briefings in Bioinformatics 13, no. 6 (November 1, 2012): 743–46. https://doi.org/10.1093/bib/bbs020.
Teeling, Hanno, and Frank Oliver Glöckner. “Current Opportunities and Challenges in Microbial Metagenome Analysis–a Bioinformatic Perspective.” Briefings in Bioinformatics 13, no. 6 (December 1, 2012): 728–42. https://doi.org/10.1093/bib/bbs039.
Mande, Sharmila S, Monzoorul Haque Mohammed, and Tarini Shankar Ghosh. “Classification of Metagenomic Sequences: Methods and Challenges.” Briefings in Bioinformatics 13, no. 6 (November 1, 2012): 669–81. https://doi.org/10.1093/bib/bbs054.
Others
Yates, Andrew, Kathryn Beal, Stephen Keenan, William McLaren, Miguel Pignatelli, Graham R.S. Ritchie, Magali Ruffier, Kieron Taylor, Alessandro Vullo, and Paul Flicek. “The Ensembl REST API: Ensembl Data for Any Language.” Bioinformatics 31, no. 1 (2015): 143–45. https://doi.org/10.1093/bioinformatics/btu613.
Zerbino, Daniel R., Premanand Achuthan, Wasiu Akanni, M. Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, et al. “Ensembl 2018.” Nucleic Acids Research 46, no. D1 (2018): D754–61. https://doi.org/10.1093/nar/gkx1098.
Web references
NCBI Videos
These videos are complementary to our classes. They cover the same topics with more detail. Please watch them to understand better this course.
Sequences
- NCBI Minute: A Beginner’s Guide to Genes and Sequences at NCBI (33:44)
- NCBI Minute: How to Quickly Retrieve Sequences from NCBI (23:38)
- NCBI: Download a custom set of records (03:11)
- NCBI: Retrieve Sequences for an Organism (01:36)
- Obtain Genomic Sequence for a gene (02:47)
- Webinar: Accessing 1000 Genomes Data at NCBI (32:15)
- NCBI Minute: Important Changes Coming to the Sequence Databases - GI Numbers (24:26)
Visualization
Literature
- Webinar: Pubmed for Scientists (45:19)
- NCBI Minute: Tailor Your PubMed Search Experience with My NCBI (07:47)
- NCBI Minute: Keeping Current and Getting Help with NCBI Resources (14:22)
- NCBI Minute: On the NCBI Bookshelf, Textbooks for Free! (19:42)
- NCBI Minute: An Updated PubMed is on its Way! (25:30)
- Need the Full Text Article? (02:03)
- The NCBI Minute: PubMed Commons (12:06)
- NCBI Minute: Finding Genes in PubMed (11:50)
- The NCBI Minute: How You and Your Journal Club Can Contribute Using PubMed Commons (12:48)
- PubMed: Using the Advanced Search Builder (03:12)
Searching
- NCBI Minute: Finding Gene, Protein and Chemical Names, Aliases and Synonyms (15:17)
- NCBI Minute: How to Locate and Use Human Genomes and Annotations from the NCBI (09:08)
- Find in This Sequence (02:17)
- Save Search Results in Collections, including Favorites (02:57)
- NCBI Minute: Setting Up Alerts for New Data in My NCBI (07:46)
- NCBI Minute: Automate PubMed Searches & Save Citation Collections with My NCBI (12:55)
- My NCBI (02:30)
- PubMed Advanced Search Builder (02:27)
- PubMed: The Filters Sidebar (02:02)
- Use MeSH to Build a Better PubMed Query (03:03)
- E-Utilities Introduction (03:46)
BLAST
Methodology
We have a lot of content to learn, and only 3 hours of contact every week. Fortunately, this semester we will teach online, following the flipped-classroom methodology. Each week the learning process will have five parts:
- A video or slides with the essential content. Students must watch them before the scheduled class time.
- Extra material from online sources. We will not reinvent the wheel, so we will use papers and videos publicly available as part of the content.
- Homework that each student should deliver on time. There will be enough time to work, but no extensions. Students are encouraged to work in group and discuss with their peers, but the deliveries must be individual. Copy-and-paste is not allowed.
- Online conversations on the forum. All students are encouraged to ask questions and answer them.
- Presentations during scheduled class time. After all the personal work, students will present their results and ask any remaining question not answered in the forum.
Prerequisites
This course does not require knowledge of coding or programming, but it will always be a strong advantage —in this course and in professional life— to know how to code a program. You will need:
- a computer with internet access, for attending the classes and for doing the homework. A camera and microphone is required to participate in class. You can use a smartphone.
- know how to handle files and folders in the computer. Copying and moving files. Understand the folders’ structure.
- know the difference between text and binary files, and between text editors and word processors.
- Install a text editor —not a word processor. There are many and you can use your favorite one. We recommend either Visual Studio Code or Atom.
Syllabus
We follow partially the plan proposed by Sayres (2018)Sayres, et al. “Bioinformatics Core Competencies
for Undergraduate Life Sciences Education.” PLoS ONE 13, no. 6
(2018): 1–20. https://doi.org/10.1371/journal.pone.0196878.
. At the end of the course students should be able to:
- Understand the role of computation and data mining in hypothesis-driven processes within the life sciences
- Understand computational concepts used in bioinformatics
- Know the basic file types used in bioinformatics (FASTA, GBK, GFF, BLAST, FASTQ, SAM)
- Understand tree structures that are used to understand biological entities: phylogeny, taxonomy, ontology. Understand the difference between taxonomy and phylogeny
- Know how to access genomic data on the web
- Access NCBI nucleotide, protein, GEO, SRA databases, Entrez query system, EBI databases.
- Know how to handle the basic file types used in bioinformatics
- How to read them, how to understand them, how to transform one into another.
- Know how to visualize DNA sequences, partial genome assembly results, and protein domains
- Understand the results given by a bioinformatic tool
- know the different types of pairwise alignments (global,
semi-global, local) and when to use each one
- Know the biological hypotheses behind the alignment scores
- Understand the challenges of multiple alignment, how to use them to find SNPs. Know how to build phylogenetic trees.
- Understand how Databases Search works:
- Understand the difference between algorithms and heuristics, the role of indices
- Assigning putative functions to coding genes, using COG and Gene Ontology
- Assigning putative taxonomic identity, using alignment and alignment-free methods
- Understand the main DNA-assembly methodologies: Overlap-layout-consensus and De Bruijn graphs.
- know the different types of pairwise alignments (global,
semi-global, local) and when to use each one
- Know how to design PCR primers and understand how to calculate the DNA melting temperature