20 October 2016

My name is Andrés Aravena

Türkçe bilmiyorum 😟
I am

  • Assistant Professor at IU Mol. Biology and Genomics Dpt.
  • Mathematical Engineer, U. of Chile
  • PhD Informatics, U Rennes 1, France
  • PhD Mathematical Modeling, U. of Chile
  • not a Biologist
  • but an Applied Mathematician who can speak “biologist language”

I come from Chile

world

Chile

chile

Near 17 million people

Spanish colony 500 years ago (so language is Spanish)

Independent Republic 200 years ago

Everyday life very similar to Turkey

Chilean Economy: Exports

exports

1st world producer of copper

2nd world producer of salmon

Fruits
peaches, grapes, apples, avocado
Wine
exported worldwide

Biotechnology can improve all these industries

My main work was on bio-mining

Official data for 2014. Banco Central de Chile

Copper is melted

to separate it from other compounds

This is
very expensive

.. and contaminant

(this smoke is sulphuric acid)

Solution: Bioleaching

The use of bacteria to extract elements from ore

Bioleaching is much better that melting copper

  • Reduced contamination
  • Cheaper

The goal is to understand and improve the involved bacteria so this technology can be used extensively

Enables building new mines

It is like discovering petrol reserves for the country

Bioleaching bacteria

We had a research contract with the main mining company

State owned, big enough to pay for long term research

We focused mainly on 2 questions:

  • Monitoring the microbial community in the mine
  • Understanding how these bacteria do “mining”

Monitoring Environmental Community

We developed models and tools to design

  • qPCR primers
  • oligos for microarrays
  • statistical models
  • practical software tools

that enable quick and precise detection and quantification of the complex metagenome

Results

N. Ehrenfeld, A. Aravena, A. Reyes-Jara, N. Barreto, R. Assar, A. Maass, P. Parada, Design and use of oligonucleotide microarrays for identification of Biomining microorganisms. Advanced Materials Research 71-73 (2009) 155-158.

Patents

  • Method for the design of oligonucleotides for molecular biology techniques.
    • Australia, Mexico, South Africa, USA
  • DNA fragments array from biomining microorganisms and method for detection of them.
    • Argentina, Australia, Mexico, Peru, South Africa, USA
  • Array of nucleotidic sequences for the detection and identification of genes that codify proteins with activities relevant in biotechnology present in a microbiological sample, and method for using this array.
    • Australia, Chile, China, Mexico, Peru, South Africa, USA

Patents (2)

  • Method and array for detection and identification of microorganisms present in a sample by using genomic regions coding for different tRNA-synthetases.
    • Argentina, Australia, Chile, Mexico
  • Method for the identification and quantification of microorganisms useful in biomining processes.
  • Australia, Chile, Mexico, South Africa, USA

Today

Istanbul University

“Dry Lab”

  • Genomics and bioinformatics
  • Metagenomic analysis of ancient DNA
    • Collaboration with METU and Sweden
  • Very enthusiastic post-doc

Understanding Biomining Bacteria

We identified the 3 most abundant species + Acidithiobacillus ferrooxidans + Acidithiobacillus thiooxidans + Leptospirillum ferrooxidans

We sequenced their genomes, measured their gene expression and metabolite concentrations

  • genomics
  • transcriptomics
  • metabolomics

We had several questions.

  • One of the key ones was: understanding transcriptional regulation

  • Limitations:
    • Cell modification is not feasible
    • Knock out is not feasible
    • Cells live at pH 1.6

Traditional experimental approaches are not viable

Our Approach

Our Approach

Modeling regulation by integrating genomic and transcriptomic data.

  • Microarray results for several stress conditions
    • identification of co-expressed genes
  • Annotated genomic sequence
    • Identify putative Transcription Factors and Binding Sites.

Using E.coli for model evaluation

  • Genomic sequence available
    • 4523 genes
  • Differential expression data available:
    • 907 arrays at Many Microbial Microarray Database
  • Several experimentally validated regulations described in the literature
    • RegulonDB 8.1 describes 2650 E.coli operons.
    • Describes 1652 regulations between operons.

Co-expression

Identification of sets of genes sharing similar behaviors through different environmental conditions, by

  • Linear Correlation, or
  • Mutual Information (many methods)

Result: Big influence graphs where 2 genes are connected when they are “similar”. Millions of edges

Problem: Confusion between direct and indirect relationships

Significant co-expression

Several approaches exist to separate direct and indirect relationships:

Relevance Networks, ARACNe, C3NET, MRNET

Network size reduces 10-20 times

Pairs of co-expressed operons

We assume that all genes in each operon are co-expressed. This simplifies the analysis

We used Maximal Relevance/Minimal Redundancy criterion (MRNET) to determine co-expressed operons

Result: Influence network with 61,506 edges. 6 of them are validated regulations

How to explain the other co-expressions?

Physical Interaction networks

A transcriptional regulatory network (TRN) is a physical model of the interactions

  • from regulators: genes coding for Transcription Factors
  • to target genes: those having a Binding Site for the TF in the promotor region

  • Modulate the global expression of genes through regulatory cascades.

Model: Explaining co-expression

Co-expression is explained by the existence of a common regulator acting on them directly or indirectly through a regulatory cascade. Either:

  • There is a directed path from one gene to the other. The first is regulating the last by a regulatory cascade.
  • None of the genes is regulating the other but both are co-regulated by a third gene.
    • This case is represented in the network by a v-shape: two paths from a common regulator to each co-regulated gene.

V-shapes

  • For a given pair of co-regulated genes A and B, we want to find the possible explanations for their co-regulation.
  • We call an explanation of A and B to any path from A to B or from B to A or any set of arcs forming a v-shape between them.

Experimental regulations explain few co-expressions

The network of experimentally validated regulations described in RegulonDB only explains 3,990 (6.5%) of the 61,506 observed co-expressions.

  • Only a few co-expressions were explained by a single validated arc
  • The rest could only be explained through regulatory cascades.

TRN can be predicted in silico

  • Using patterns from experimental data
  • Plus some probabilistic models
  • Sites with low p-value are putative binding sites
  • Transcription factors are predicted by homology

Building a putative TRN

Predicted TRN can explain most co-expressions

A putative TRN was built using E.coli genomic sequence and patterns from Prodoric database of transcription factors and binding sites.

We found that this putative TRN explained 91.1% of the pairs of co-expressed operons.

Putative TRNs are usually huge

Putative TRNs are usually huge, due to the low specificity of methods based on the sequence.

  • Putative TRN has 25,604 regulations
  • Predicted regulations may not be real
  • But contains regulations that explain 91.1% of co-expressions
  • A realistic subnetwork can be chosen in a biologically meaningful way

This is the main motivation of our model

Lombarde

our model

Graphical Illustration

Overview of LOMBARDE

The LOMBARDE method requires for the studied organism the following input:

  • a putative TRN represented by a weighted directed graph \(\mathcal G\), with vertices corresponding to genes and arcs connecting regulator genes to regulated ones.
    • An arc connects two genes if the first gene codes for a transcription factor that presumably binds in the promoting region of the second gene.
    • The \(p\)-value \(p_i\) associated with this arc reflects the confidence level of this prediction.
  • a set \(\mathcal O\) of pairs of co-expressed genes.

Overview of LOMBARDE

  • In a first stage LOMBARDE assigns to each arc a discrete cost \(w_i\) in a way such that the more confident arcs have lower cost. \[w_i = F(p_i)\]
  • LOMBARDE discretizes the \(p\)-values into \(k\) categories.
  • This allows to define the function \(Cost(S)\) for any subgraph \(S\) as the sum of the costs of its arcs. \[Cost(S)=\sum_{i\in S}w_i\].

Costs of arcs

To avoid “shortcuts” we use costs that grow exponentially

Better ten “good” steps at cost 1 than one “weak” step at cost 10

Overview of LOMBARDE

  • LOMBARDE explains the co-expression of the pair \((gene_{1}, gene_{2})\in \mathcal O\) by identifying a common regulator \(gene_{3}\) which is connected to both \(gene_{1}\) and \(gene_{2}\) via regulatory cascades of high confidence.

  • In graph terms, a subgraph \(S\) is an v-shape for the pair \((gene_{1}, gene_{2})\) if \(S\) is the union of two independent paths from \(gene_{3}\) (the common regulator) to \(gene_{1}\) and to \(gene_{2}\).

Confident explanations

  • An v-shape for \((gene_{1}, gene_{2})\) is said to be confident if it is of minimum cost among all the explanations for the pair.

  • Our model transforms a parsimony cirteria into a graph minimization problem.

  • The output of LOMBARDE is a subgraph \(\mathcal L\) of \(\mathcal G\) built as the union of all confident explanations for each co-expressed pair in \(\mathcal O\).

Results

Biased towards validated regulations

  • The putative TRN for E.coli contains 25,604 arcs, 444 of them are experimentally validated.
  • After applying LOMBARDE most of its arcs are discarded, keeping only 4,922 (19.0%).
  • However, among the validated arcs, LOMBARDE is less aggressive, keeping 295 (66.4%) of them.

Biased towards validated regulations

  • This shows that the output of LOMBARDE is biased towards experimentally validated regulations.
    • An hypergeometric test confirms this bias, with an enrichment \(p\)-value under \(10^{-105}\).

LOMBARDE can complete a partially experimental TRN

  • We also considered an extended TRN combining all E.coli validated regulations and all arcs in the putative TRN
  • Near 30% of non-validated arcs are replaced by a set of similar size where almost all arcs are validated.
  • There is a core of regulations preserved in LOMBARDE output

Even without experimental data results are good

Venn Diagram

Summary of Results

Network Explained co-expressions Num. Vertices Num. Arcs Num. Arcs in E.coli
TRN built from E.coli 3,990 (6.5%) 823 1,652 1,652
E.coli ab initio \(\mathcal G\) 56,044 (91.1%) 2,390 25,604 444
Lombarde output \(\mathcal L\) 56,044 (91.1%) 2,336 4,922 295
E.coli extended \(\mathcal G_e\) 56,789 (92.3%) 2,434 26,812 1,652
Lombarde output \(\mathcal L_e\) 56,789 (92.3%) 2,370 4,374 1,520

LOMBARDE produces a topologically realistic TRN

  • Average degree (number of interactions per operon) of the putative TRN was 10.7
  • The value suggested in literature is in the range 1.5 to 2.0.
  • Average degree of LOMBARDE output was 2.1.
  • This is also close to the average degree in the network of validated regulations for E.coli, 2.0.

Degree distribution

  • The degree distribution (proportion of operons for each degree) in LOMBARDE output is similar to the network of validated regulations, meaning that they share some structural properties.

Global relevance of regulators can be evaluated using centrality indices

The network produced by LOMBARDE also contains most of the global regulators described for E.coli

Using the radiality index, we could rank the regulators on LOMBARDE output. Among the most relevant regulators in this network we recovered 10 of the known global regulators.

When LOMBARDE was applied to the extended input, the result recovers 18 of the known global regulators, 14 of them among the most relevant ones.

Core of predicted E.coli regulators

Ranking of predicted E.coli regulators

Gene name Ranking in literature Ranking in Lombarde for \(\mathcal G\) Ranking in Lombarde for \(\mathcal G_e\)
crp 1 25 1
ihfA 2 14 4
ihfB 3 16 5
fnr 4 1 6
fis 5 63 2
arcA 6 13 7
lrp 7 34 87
hns 8 14
narL 9 121 126
ompR 10 143 96
fur 11 7 8
phoB 12 9 25
cpxR 13 80 22
soxR 14 69 49
soxS 15 109 18
mtfA 16
cspA 17 42
rob 18 30 95
purR 19 39 47

Results for A.ferrooxidans

Results for A.ferrooxidans

  • 64 regulators identified
  • 19 of them have no known function
  • Enrchment of Nitrogen related regulators
    • Nitrogen fixation has been identified as a relevant factor in bioleaching (Levican et al, 2008)

Published on BMC Bioinformatics

Acuña, Vicente, Andres Aravena, Carito Guziolowski, Damien Eveillard, Anne Siegel, and Alejandro Maass. 2016.

“Deciphering transcriptional regulations coordinating the response to environmental changes.” BMC Bioinformatics 17 (1)

Perspectives

  • Instead of p-values use better indices for regulation confidence
  • Biological networks arise from evolution. We can derive extra restrictions from this fact.

Conclusion

LOMBARDE produces networks with realistic degree distributions, recovering and giving a central role to most of the global regulators described in literature.

In other words, LOMBARDE shapes the resulting network towards the structural characteristics of a true regulatory network.

Thanks!