My name is Andrés Aravena

Türkçe bilmiyorum 😟

I am

  • New Assistant Professor at Molecular Biology and Genomics Department
  • Mathematical Engineer, U. of Chile
  • PhD Informatics, U Rennes 1, France
  • PhD Mathematical Modeling, U. of Chile
  • not a Biologist
  • but an Applied Mathematician who can speak “biologist language”

I will speak about

  • What I’ve done before so you can understand why I’m here
  • What I’m doing now at Istanbul University
  • What I foresee from my “outsider” point of view

The Past, Present and Future

Facts, opinion and guess

I’ve worked on

  • Big and small computers
  • Telecommunication Networks
  • Between 2003 and 2014 I was the chief research engineer
    • on the main bioinformatic group in my country
    • in the top research center (CMM)
    • in the top university (University of Chile)
    • of my country

I come from Chile

world

Chile

chile

Small country of ~17 million people

Universities ranks similar to Turkish ones

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

First Latin American country to recognize Turkish republic

OECD member

Everyday life very similar to Turkey

Chileans like Turkish soap operas

The most successful soap opera last year was Bin Bir Gece

Chilean Economy: Exports

1st world producer of copper

2nd world producer of salmon

Fruits: peaches, grapes, apples, avocado

Wine: exported worldwide

Official data for 2014

How can we improve these industries

using Molecular Biology and Bioinformatics?

The natural question was

Fruits

Peach and Grapes

Gene expression analysis for industrial applications:

  • Peach: response to cold stress
  • Grapefruit: development related to seed and grape size (Sultaniye)

Fruits

Peach

Fruits

Grapes

For wine and to eat as dessert, like Sultaniye

We want big grapes and small seeds

but grape size depends on hormones produced by the seed

Which genes are involved on seed and grape growth?

Strategy:

  • Gene expression analysis using microarrays,
  • Whole genome sequencing.

Fishes

Salmon

Salmon

Farmed salmons are feed with cheap vegetal protein But wild salmons eat animal protein

How is salmon’s metabolism affected by the diet? Which genes change their expression because the changes in food?

  • Gene expression analysis using microarrays
  • Fish selection for breeding using microarrays (patent pending)

Fishes

Salmon Genomic Sequence

… and sequencing of whole Salmo salar genome

(10 million dollars project)

Wine

Chilean wine travels long distances to final markets

Any yeast contamination means big economic loses (people stops buying all Chilean brands)

Quality control is usually done growing samples for 3 days But time is expensive: penalty for shipping delays

We designed qPCR method for rapid detection of yeast contamination

It is currently used by one major wine producer in Chile. It may be sold to Roche.

Mining industry

molecular biology to extract copper

A little chemistry: Copper is part of a compound, with Sulfur and Iron. Ferric acid separates it.

Cu2S + 4Fe3+ ⟶ 2Cu2+ + 4Fe2+ + S

Resulting Cu2+ is soluble and is recovered.

But all Fe3+ transforms to Fe2+ and reaction stops

There are bacteria that “eat” e- and keep the reaction going on

Fe2+ ⟶ Fe3+ + e-

Why is it important?

The biological method is much better that the standard one

  • Reduced contamination
  • Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

It enables building new mines

It is like discovering petrol reserves for the country

Most of the results are still industrial secret

We had a research contract with the main mining company

State owned, big enough to pay for long term research

Few papers, many patents

Bioidentification

Monitoring the presence of good bacteria

We need to control the “ecosystem” on the mine

Molecular Biology methods are fast, sensible and reliable

They can be used in place: metagenomic approach. No culture

Key problem: Design probes that match a taxonomic branch, not a specific strain

The probes should be tolerant to mutations that occur in environmental samples with many strains

Classical tools don’t work on big scales

Design of probes for complex samples

I designed and built a solution using a super-computer

Calculation tool one day on 32 processors (one processor month)

Resulting probes worked as expected

They can be used on qPCR or in microarrays.

Automatic Interpretation of Results

using a Statistical Classification Model

Publications

The microarray was published in N. Ehrenfeld, A. Aravena, A. Reyes-Jara, N. Barreto, R. Assar, A. Maass, P. Parada, Design and use of oligonucleotide microarrays for identification of Biomining microorganisms. Advanced Materials Research 71-73 (2009) 155-158.

Patents

The method and the probes have been patented in

  • USA, Number: US 7 853 408 B2, Date: 14/12/2010;
  • South Africa, Number: 2006/06828, Date: 26/03/2008;
  • Australia, Number: 2006203551, Date: 15/09/2011;
  • Mexico, Number: PXMX 32/2006, Date: November 2012.
  • Peru, Number: PE 5838, Date: 29/10/2010;
  • Chine, Number: 200810095172.6, Date: 2013;
  • Chile, Number: DPI-660-2007, Date: 06/05/2013;
  • Argentina, Number: AR056179

Functional genomics

How does the bacteria work?

To improve the process we need to see inside the black box. We sequenced the complete genome of 3 bacteria

  • Acidithiobacillus ferrooxidans
  • Acidithiobacillus thiooxidans
  • Leptospirillum ferrooxidans

We paid over USD $150K. Today is USD $5K

Hint: Sequence assembly requires a big computer. It does not work on a regular PC

What do we learn from the DNA sequence?

We used Hidden Markov Models and Pattern Matching techniques to determine the genes and their functions

We learned that

  • Acidithiobacillus thiooxidans had all the machinery to build flagella
  • Acidithiobacillus ferrooxidans has a region where all genes do not have orthologous
  • We identified transcription factors and enzymes
  • which was not knew before
  • It covers 10% of the genome

Modeling Metabolism

We predict which genes code enzymes

Each enzyme catalyzes a reaction, with a known stoichiometry

Every reaction gives an equation

All equations plus boundary conditions give model to predict metabolite concentration

We can predict how the cell adapts to environmental changes

Modeling Regulation

From the genome sequence we can predict which genes code for transcription factors and they bind

They form a putative regulatory network.

But current methods produce too many false positives

We expected ~4K regulations. We got 25K regulations.

I integrate this model with microarray data to find the “most probable” regulatory network using a parsimony criterium

Systems Biology

beyond Bioinformatics

A very active research area that aim to understand the cell as a system with complex interactions

The focus is not on the genes, is on the genome

The key is to understand networks

  • regulatory
  • metabolic
  • signaling
  • protein-protein-interaction

Why Computers in Molecular Biology and Genetics?

The present

DNA is digital information

All experimental values in science are measured with an observational error. (e.g. temperature is 10.2 ± 0.05°C, pressure is 101215 ± 125 Pa)

Except genetic sequences: Nucleotides are either A, C, T or G.

There is no “average” or “intermediate case”

So is natural to use computers and information theory to model DNA

but there is another reason …

The sequencing of the human genome, made public by the president of USA, captured the attention of everybody.

Science converges to Molecular Biology

Physicists, mathematicians, computer scientist and engineers, turned their attention to molecular biology questions.

They come looking with new eyes and creating new theoretical and practical tools.

Molecular Biology has always interacted with other disciplines

Just consider the word “Biochemistry”

Internet makes Molecular Biology theory accessible to more people

Before Internet times

  • top science was accessible only to researchers with money to
    • make complex experiments or
    • buy expensive books and journals
  • finding references took several weeks by regular mail
  • Professors had the only copy of the textbooks

Today

  • all journals are accessible on-line
  • references are download in minutes at low cost
    • free when the article is Open Access
  • experimental results of each article are also free

Anyone can analyze this data

Structured data is easy to process to discover new knowledge.

The software for this meta-analysis is also Open Source

Scientist can adapt the program internal code to solve their specific question

Anyone can download these programs without cost.

If the analysis requires big computational power you can rent it at low cost

You don’t need your own super-computer

You can rent Cloud computers

Companies like Amazon.com and Google sell their spare computer power at low prices

This enables researchers to carry computations that would be impossible otherwise.

The World is Flat

This democratization of knowledge provides an exciting challenge.

Rich countries have no longer the monopoly of knowledge.

We can be players in the big leagues, on a leveled surface.

We can read the same books and the same articles, use the same machines and the same programs.

Anyone could make the new scientific breakthrough, either in New York, New Delhi or Istanbul.

But the same opportunity presents to everyone else.

There are more PhD students than ever

And many of them will be on Molecular Biology

Cyranoski et al. 2011. “Education: The PhD Factory.” Nature 472: 276–79.

More players come to the game

Emerging economies push up the number of researchers worldwide

India graduates more than a million engineers each year. Many of them in biotechnology

Egypt has 35.000 PhD students and Israel 10.000.

Many of them will find jobs in Molecular Biology companies or academia

and China, Korea, Ukrania

Hays, Thomas. 2011. “PhDs: Israel Also Trains Plenty.” Nature 473 (7347). Nature Publishing Group: 284–84.

How will we be different?

Success of Molecular Biology generates Big Data

Advances in molecular biology technology has produced

  • new generation sequencers
  • microarrays
  • mass spectrometers
  • real-time PCR.

They produce

  • reproducible experimental results
  • in big volumes
  • at low cost

Data production costs is falling

The first bacterial genomic sequence was published in Science journal.

Today it would be just a shot communication.

National Human Genome Research Institute. http://genome.gov/sequencingcosts

Extracting Information from Raw Data

Surviving the Data Tsunami

In a few years we passed from lack of data to excess of it

We need to learn how to extract biological meaning from big volumes of data

Classical methods are not enough

What is significant? What is the “null hypothesis”?

If we don’t fully analyze our own experimental data, someone else will do

And they will publish it

The plan

what we will teach

Teaching “Introduction to Data Science”

The students will learn

  • how to handle experimental data
  • how to communicate with scientists of other data-oriented disciplines
  • how to produce publication quality reports with reproducible results
  • How to get raw data, extracting relevant information, filter it using several selection criteria.
  • How to store and retrieve it in efficient and useful ways.
  • How to transform it, organize it, categorize it, display, show and understand the results.

Teaching “Scientific Computing”

Teach Python and BioPython to analyze, model, evaluate and predict the behavior of genomic and molecular biology entities.

The students should be able to interact with high end servers, use command line tools and be comfortable in computing environments others than Microsoft Windows.

Tools include Unix command line tools, SQL and the R statistical package.

The student should be able to understand how computer networks work and what are their limitations.

The idea is no to be experts on computers, but to have the concepts and language to work in interdisciplinary groups

Let’s start learning Data Science

To test these ideas we start next week an

Introduction to Data Science Workshop

The mathematical tools can be explored together with the biological context, so they make sense and are easier to learn.

I will give you a link at the end of this talk.

If you are interested visit the webpage and send an email.

after all, maybe I’m just crazy

Every normal student is capable of good mathematical reasoning if attention is directed to activities of his interest

Jean Piaget, 1976
Swiss psychologist and philosopher

A Secret

You can also learn at home

Everything we will show is available on the Internet

You just need to look for it

But it is in English

Translation takes too long

Translated science is obsolete science

The Future

My personal prediction

It is hard to make predictions, especially about the future
Danish proverb

Molecular Biology has become mainstream

Genomic tools are also used outside academia.

Several companies provide “personalized DNA services”.

  • 23andMe, partially owned by Google.
  • The Genographic project, created by the National Geographic Society and IBM.

Both offer to trace ancestry and migrations of the human population. Any person can know which are his true origins.

Example

Molecular Biology will follow the path of computers

Today PCR thermocyclers are expensive devices found in universities and research centers, very much like desktop computers were in the 70’s and 80’s.

Nowadays computers are low-cost and found everywhere.

Will the same happen with PCR?

PCR future

Today only a few companies produce PCR thermocyclers, just like smartphones such as the iPhone and Samsung.

Nevertheless you can see them everywhere.

And this is a big opportunity for creators of software applications.

The value is in the apps. Ask Nokia or Blackberry

A computer on every desk and in every home, all running Microsoft software

Bill Gates,
Microsoft’s founding mission.

PCR is the new PC

Gates set this goal in the late 70’s, when it was not obvious if people would even see a computer in their lives.

PCR technology is now in the same state that Personal Computers were in 1975. If PCR machines become inexpensive,

  • and there is “a PCR on every desk and home”,
  • in hospitals,
  • restaurants
  • and high schools,

then who will be making “software apps” for them?

If PCR machines are available everywhere

applications can be:

  • Determining ancestry (e.g. race horses, farm animals, fishes)
  • Detection of unwanted organisms
  • Marker-assisted breeding
  • Food quality control (e.g. in an university canteen)
  • Security and control of Genetically Modified Organisms
  • Polymorphism detection
  • Clinical diagnosis
  • Personalized medicine
  • Police forensic analysis

Software for PCR

the specific parameters of an application

  • DNA extraction protocols
  • Primers design
  • Amplification protocols
  • Detection methods

I think we should prepare our students to make these “apps”.

They should have easy access to low-cost thermocyclers, use them frequently and creatively.

Then, like in the computer industry, they may create completely new applications that we cannot foresee now.

New tools for new science

New Instruments trigger advances in Molecular Biology

and in other sciences

They are usually named according to their inventor

  • Galileo created modern science when he made his own telescope
  • Newton also invented a new kind of telescope, still used today
  • Bunsen enabled spectrometry analysis with his burner
  • Svedberg ultracentrifugue (16S)
  • Sanger DNA sequencing method
  • Southern blot method for specific DNA detection
  • PCR to amplify DNA samples

Notice that most of these inventors got Nobel prizes for their contributions.

Scientific Instrumentation

I propose to create a course on “Scientific Instrumentation” using initially software tools.

Making instruments is now “software”, not craftsmanship.

We can understand this with a biological analogy.

  • Designs in digital files are like genes.
  • 3D printers are like ribosomes, producing physical versions of the design.
  • Online collaboration is like the evolution: designs are changed to improve their fitness.

It is not rocket science

It is not heart surgery