Highlights from the Third International Society for

BMC Bioinformatics 2015, Volume 16 Suppl 3
Open Access
Highlights from the Third International Society for
Computational Biology (ISCB) European Student
Council Symposium 2014
Strasbourg, France. 6 September 2014
Published: 13 February 2015
These abstracts are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S3
Highlights from the Third European International Society for
Computational Biology (ISCB) Student Council Symposium 2014
Margherita Francescatto1, Susanne MA Hermans2, Sepideh Babaei3,
Esmeralda Vicedo4, Alexandre Borrel5,6,7, Pieter Meysman8,9*
Department of Genome Biology for Neurodegenerative Diseases, German
Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany;
Computational Discovery and Design (CDD) group, Centre for Molecular
and Biomolecular Informatics (CMBI), Radboudumc, Nijmegen, The
Netherlands; 3Delft Bioinformatics Lab, Delft University of Technology, The
Netherlands; 4Department for Bioinformatics and Computational Biology,
Institut für Informatik, TU München, Munich, Germany; 5INSERM, UMRS-973,
MTi, Paris, France; 6University Paris Diderot, Sorbonne Paris Cité, UMRS-973,
MTi, Paris, France; 7University of Helsinki, Division of Pharmaceutical
Chemistry, Faculty of pharmacy, Finland; 8Department of Mathematics and
Computer Science, University of Antwerp, Antwerp, Belgium; 9Biomedical
Informatics Research Center Antwerp (biomina), University of Antwerp/
Antwerp University Hospital, Edegem, Belgium
BMC Bioinformatics 2015, 16(Suppl 3):A1
In this meeting report, we give an overview of the
talks, presentations and posters presented at the third
European Symposium of the International Society for
Computational Biology (ISCB) Student Council. The
event was organized as a satellite meeting of the
13th European Conference for Computational Biology
(ECCB) and took place in Strasbourg, France on
September 6th, 2014.
Introduction: The ISCB Student Council (SC) is the student organization of
the International Society for Computational Biology. Its members are
typically PhD students in the fields of bioinformatics or computational
biology, but include as well scientists in different stages of their career. They
come from all around the world and share a passion for bioinformatics and
computational biology. The mission of the SC is to support the development
of the next generation of computational biologists. This is achieved through
the provision of scientific events, networking opportunities, soft-skills
training, educational resources and career advice, while attempting to
influence policy processes affecting science and education.
The European Student Council Symposium (ESCS) is one of the activities
organized by the SC as a satellite meeting accompanying the European
Conference for Computational Biology (ECCB). It is therefore the European
spin-off of the Student Council Symposium (SCS), which celebrated its
10 th anniversary this year [1] and is a satellite meeting of the annual
Intelligent Systems for Molecular Biology (ISMB) conference. The ESCS has
been organized every two years, when ECCB was not conjoined with
ISMB, since 2010.
Scope and format of the meeting: This year, the 3rd ESCS took place in
Strasbourg, France on September 6th in conjunction with the 13th ECCB
conference. The main goal of the meeting was to create opportunities for
young researchers to meet and discuss with peers from all over the
world, so that ideas could be exchanged and networks built. In addition
three highly successful principal investigators were invited to deliver
inspiring keynote talks.
We received more than 30 abstract submissions from students who wished
to present their work at the symposium. These submissions were peerreviewed by an independent program committee, and eight abstracts were
selected for oral presentations. Another eighteen abstracts were selected to
be presented as a poster. Thanks to the generous contributions of our
sponsors, we were able to provide four travel fellowships to support student
attendance to ESCS. Overall, almost 30 delegates from 13 different countries
attended the symposium and the program included three inspiring keynote
lectures, eight contributed student presentations, and a lively poster session.
The oral presentations were divided into three themed sessions, namely
Modeling, Systems Biology, and Networks and Statistics. For the first time in
an event organized by the SC, the five delegates with the best posters were
given the opportunity to present their work in a flash presentation. This
ensured that all attendees had the chance to hear and see the top poster
selection unconstrained by the population limits intrinsic to a normal poster
presentation. All abstracts of the accepted oral presentations are included in
this meeting report. Abstracts of the poster presentations can be found
online in the symposium booklet http://escs2014.iscbsc.org/escs-booklet.
Keynotes: The consistent theme of the ESCS keynotes was the different
aspects of dealing and interpreting the massive amounts of biological data
that is nowadays available, often publicly and without restrictions of use. In
the morning, Dr. Lennart Martens introduced the concept of ‘Saprotrophics’,
a field that many bioinformaticians might be working in without realising it
and that has even peaked the interest of the social sciences [2]. The central
idea behind Saprotrophics is that, with the appropriate methods, new
knowledge can be obtained from massive amounts of public data, in
directions that go far beyond the original intention and purpose. Although
such analyses come with their own set of unique challenges, these can be
overcome with proper approaches. Dr. Martens gave an overview of such
challenges, of possible ways to tackle them and in addition he showed
some interesting applications.
© 2015 various authors, licensee BioMed Central Ltd. All articles published in this supplement are distributed under the terms of the
Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and
reproduction in any medium, provided the original work is properly cited.
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 2 of 10
The second keynote, by Dr. Jeroen de Ridder, underlined the critical
importance of scale in biological data sets. Depending on the scale used to
analyze and interpret data, the features and patterns that emerge can
change quite dramatically; this is comparable to the change in perception
we have of a landscape when we are flying over it or walking in it. Through
an array of working examples Dr. de Ridder guided the audience into
understanding that meaningful new insights in molecular data analyses can
be achieved by accounting for the importance of scale and using scaleaware analyses tools.
Finally, the keynote from Dr. Lars Juhl Jensen concerned the efforts needed
to collect and combine data from different sources into a single, biologically
meaningful, network. In his talk, Dr. Jensen detailed the efforts and
techniques that were necessary to construct the STRING database [3]. This
database combines data derived from different curated databases, applying
refined automatic text mining techniques and computational prediction
approaches. Several of these methods have been integrated into web-based
resources, which can be used to construct other databases and are
extremely valuable for systems biology applications.
Student presentations: From all abstracts submitted to the symposium,
the best eight were selected for oral presentations, which were divided
into three sessions.
Session 1: Modeling: Information about the DNA replication mechanisms is
scarce or absent for many viruses. Kazlauskas et al. [4] reported an analysis of
DNA replication genes across more than 1500 viral genomes. This analysis
allowed Kazlauskas and colleagues to identify previously unknown replication
components in these genomes.
Conformation alterations are often a critical step for the functionality of a
variety of proteins. Narunsky et al. [5] introduced ConTemplate, a web
server able to suggest potential conformations for proteins with an
established molecular structure based on structural similarity to other
proteins with known conformations.
Session 2: Systems biology: Proteochemometrics is the modelling of the
bioactivity of ligands against different targets. Cortes et al. [6] demonstrated
that a Bayesian inference scheme can be successfully applied to this
problem within the contexts of isoform-selective cyclooxygenase inhibition
and large-scale cancer cell line drug sensitivity.
Understanding the manner with which small compounds inhibit proteinprotein interactions would greatly help in the design of the next
generation of therapeutic compounds. Kuenemann et al. [7] studied small
molecules and protein-protein interactions of such inhibitors to identify
new putative 3D characteristics that support inhibition.
While rich information sources exist for protein interaction data, their
adaptive nature remains poorly understood. Using advanced pattern mining
techniques, Naulaerts et al. [8] discovered dynamic interaction patterns in
lists of differentially expressed proteins that could be related to cancer
Session 3: Networks and statistics: DNA methylation is an important
epigenetic marker that has been shown to be involved in gene silencing.
Döring et al. [9] modeled the differences in sequence bias that exist for
methylation determination through microarray hybridization and bisulfite
The identification of critical residues is of great interest for the field of
protein engineering. Armenta-Medina et al. [10] introduced a hybrid
approach called ANMA.SCA to determine the importance of a residue in
proteins, based on coevolution and cross-correlation of simulated atomic
Gene duplications are notoriously hard to correctly position in
phylogenetic reconstructions of the genomic evolutionary history. Peres
et al. [11] have developed a new method to improve the positioning of
gene duplication in gene trees produced by TreeBest.
Award Winners: At ESCS, four awards were given to the best presenters
of the day, namely two for oral presentations and two for poster
presentations. The attendees determined the winners by scoring the
different oral presentations based on presentation style, novelty of the
work presented, slide layout and clarity of the message. The best
presentation award went to Mélaine Kuenemann, while the runner-up
prize went to Isidro Cortes. The best posters were selected during the
noon poster session based on preferences expressed by the symposium
attendants through stickers. The five top scoring posters were given the
chance to give a 5 minutes flash presentation during the main meeting.
From these five flash presentations, the award winners were determined
by an independent jury. Poster presentation first place went to Jakob
Jespersen, and second place to Aurélie Pirayre.
Conclusions: As previous editions, the third ESCS was a great success,
characterized by talks of high profile and quality, both at the level of
keynotes and submitted work. This is confirmed by the results of an online
survey that participants were asked to fill in. Most participants agree that
the quality of the symposium was high to excellent, and that the
equilibrium between keynotes and submitted talks was good. This year, we
noted a decrease in the number of participants in comparison to ESCS of
two years ago, similarly to what observed in this year’s SCS [1]. An informal
survey among students attending the main conference that didn’t subscribe
for ESCS showed that the main reasons for not attending were either
conflicting workshops taking place on the same day or unfamiliarity with
the Student Council and its activities. Considering this, we recommend the
organizers of future symposia to implement sharp strategies to improve the
dissemination of announcements concerning the event in order to reach a
larger pool of potential delegates. We also observed that we received far
more applications for the ESCS travel fellowships than we were able to
provide. This, together with the explicit declaration in some of the
applications that attending the symposium would only be possible upon
travel fellowship awarding, suggests that the lack of funding contributed as
well to the drop in the number of delegates and underlines the importance
of maintaining and possibly expanding the Travel Fellowship program
from ISCB and its SC. Overall, we received very positive responses from
all attendees, with many comments on the high quality of the oral
presentations, both from keynotes and students.
Future perspectives: Next year the ISMB and ECCB conferences will be
co-organised in Dublin, Ireland, in July 10th to 14th. This meeting will
serve as the location for the 11th SCS and therefore the next ESCS will
only take place in 2016. For information on the Student Council and
other events we organize for students in computational biology and
bioinformatics, please visit our website: http://www.iscbsc.org.
Acknowledgements: The success of an event the size of the European
Student Council Symposium depends on the commitment of many. We are
greatly indebted to ECCB 2014 conference chairs Marie-Dominique Devignes
and Yves Moreau for giving us the opportunity to have the 3rd European
Student Council Symposium in Strasbourg. We are especially thankful for
the logistical support and invaluable advice of the ECCB organizing
committee; specifically the Workshops and Tutorials chairs Olivier Poch and
Mario Albrecht, and our ECCB intermediary Magali Michaut. We deeply
appreciate their continued support of the ISCB Student Council and the
symposium. Further, we would like to acknowledge the support of the ISCB
Board of Directors and their trust in our vision. The Student Council would
also like to thank our keynote speakers; Dr. Martens, Dr. de Ridder and Dr.
Jensen, for volunteering their time to contribute to the success of the
symposium and to promote the next generation of computational
biologists. Furthermore, we would like to thank everyone on the organizing
committee, without them, there would have been no symposium. Also we
would like to thank the SCS2014 chairs, Farzana Rahman and Tomas Di
Domenico, for the synergetic symposium collaboration. In addition, we
would like to thank the BMC Bioinformatics editorial office for their help in
publishing this report. We are also extremely grateful for the financial
support that we received from our sponsors. This year ESCS was supported
by GdrBIM, IMGT, Syngenta, Novartis, BASF and Roche. Without their support
many of the opportunities that we offered to the delegates at the 3 rd
European Student Council Symposium would not have been possible
1. Rahman F, Di Domenico T: Highlights from the Tenth International
Society for Computational Biology (ISCB) Student Council Symposium
2014. BMC bioinformatics 2015, 16(Suppl 2):A1.
2. Mackenzie A, McNally R: Living Multiples: How Large-scale Scientific Datamining Pursues Identity and Differences. Theory, Culture & Society 2013,
3. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J,
Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein
interaction networks, with increased coverage and integration. Nucleic
acids research 2013, 41 Database: D808-15.
4. Kazlauskas D, Venclovas C: Viral DNA replication: new insights and
discoveries from large scale computational analysis. BMC Bioinformatics
2015, 16(Suppl 3):A2.
BMC Bioinformatics 2015, Volume 16 Suppl 3
Narunsky A, Ashkenazy H, Kolodny R, Ben-Tal N: Using ConTemplate and
the PDB to explore conformational space: On the detection of rare
protein conformations. BMC Bioinformatics 2015, 16(Suppl 3):A3.
6. Cortes-Ciriano I, van Westen G, Murrell D, Lenselink E, Bender A, Malliavin D:
Applications of Proteochemometrics - From Species Extrapolation to Cell
Line Sensitivity Modelling. BMC Bioinformatics 2015, 16(Suppl 3):A4.
7. Kuenemann MA, Bourbon LML, Labbé CM, Villoutreix BO, Sperandio O: An
exploration of the 3D chemical space has highlighted a specific shape
profile for the compounds intended to inhibit protein-protein
interactions. BMC Bioinformatics 2015, 16(Suppl 3):A5.
8. Naulaerts S, Meysman P, Vanden Berghe W, Laukens K: Mining the human
proteome for conserved mechanisms. BMC Bioinformatics 2015,
16(Suppl 3):A6.
9. Döring M, Gasparoni G, Gries J, Nordstrom K, Lutsik P, Walter J, Pfeifer N:
Identification and Analysis of Methylation Call Differences between
Bisulfite Microarray and Bisulfite Sequencing Data with Statistical
Learning Techniques. BMC Bioinformatics 2015, 16(Suppl 3):A7.
10. Armenta-Medina D, Perez-Rueda E: Hybrid approaches for the detection
of networks of critical residues involved in functional motions in protein
families. BMC Bioinformatics 2015, 16(Suppl 3):A8.
11. Peres A, Roest Crollius H: Improving duplicated nodes position in
vertebrate gene trees. BMC Bioinformatics 2015, 16(Suppl 3):A9.
Viral DNA replication: new insights and discoveries from large scale
computational analysis
Darius Kazlauskas*, Česlovas Venclovas
Institute of Biotechnology, Vilnius University, Lithuania
BMC Bioinformatics 2015, 16(Suppl 3):A2
Background: The ability to replicate is essential for all living entities.
Duplication of genetic information is carried out by replication proteins.
DNA replication has been well studied in T7, T4 phages and herpes viruses;
however, the information about replication mechanisms from other groups
of viruses is either scarce or missing altogether. Double-stranded (ds) DNA
viruses infect cells from all domains of life, they evolve fast and are very
diverse. Their genome size varies from 5 to 2,500 kbp.
Results and conclusions: To better understand viral DNA replication, we
identified replication proteins in dsDNA viruses using current state-of-theart homology detection methods. Over 150,000 proteins from 1,574
genomes were analyzed. We found that the composition of replication
machinery depends on the virus genome size. Small viruses (<40 kbp)
use protein-primed DNA replication or rely on replication proteins from
the host. Large viruses (>140 kbp) have their own RNA-primed replication
apparatus often supplemented with processivity factors and DNA
topoisomerases to increase replication speed and efficiency. This insight
led us to a search for „missing“ replication components in large genomes
and resulted in the discovery of single-stranded DNA binding (SSB)
proteins in larger eukaryotic viruses. Surprisingly these proteins turned
out to be homologs of SSB proteins previously thought to be specific for
T7-like phages. Additionally with the analysis of the herpes viral helicaseprimase complex we found that one of its components, UL8, is a highly
diverged inactivated B-family DNA polymerase.
Using ConTemplate and the PDB to explore conformational space: on
the detection of rare protein conformations
Aya Narunsky1*, Haim Ashkenazy2, Rachel Kolodny3, Nir Ben-Tal1
Department of Biochemistry and Molecular Biochemistry, George S. Wise
Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; 2The
Department of Cell Research and Immunology, George S. Wise Faculty of
Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel; 3Department of
Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel
BMC Bioinformatics 2015, 16(Suppl 3):A3
Background: Conformational changes mediate important protein functions,
such as opening and closing of channel gates, activation and inactivation of
enzymes, etc. The entire conformational repertoire of a given query protein
may not be known; however, it may be possible to infer unknown
conformations from other proteins. We developed the ConTemplate method
Page 3 of 10
to exploit the richness of the Protein Data Bank (PDB)[1] for this purpose.
ConTemplate uses a three-step process to suggest alternative conformations
for a query protein with one known conformation [2]. First, ConTemplate
uses GESAMT to scan the PDB for proteins that share structural similarity
with the query [3]. Next, for each of the collected proteins, additional known
conformations are detected using BLAST [4], and clustered into a predefined
number of clusters [5]. Finally, MODELLER [6] builds models of the query in
various conformations, each representative of a cluster.
Results: We demonstrate the application of ConTemplate with S100A6, a
member of the S100 family of Ca2+ binding proteins. The vast majority of
proteins in this family bind Ca2+ through helix-loop-helix EF-hand motifs.
The structure of the protein includes four helices connected by three
loops. Calcium binding is coupled to a conformational change, in which
helix 3 changes its orientation with respect to helix 4 (Figure 1A and 1B)
[7]. Helix 2 also changes its positioning with respect to the rest of the
protein upon calcium binding, but the change is not as dramatic. The
RMSD between the Ca2+-bound and -free conformations is 4.46Å. The
EF-hand motif is found in many PDB entries. Yet, known structures of the
Ca 2+ -free conformation are relatively rare. These features make the
protein an interesting example for examining how the performance of
ConTemplate is affected by the distribution of conformations in the PDB:
The highly abundant Ca 2+ -bound conformation may populate a very
large cluster, which could mask the Ca2+-free conformation. Thus, finding
the latter conformation could be challenging.
Starting from the Ca2+-free conformation as a query, it is sufficient to set the
number of clusters at 2 to retrieve both the Ca 2+ -bound and -free
conformations. ConTemplate reproduces the Ca2+-bound conformation with
RMSD of 1.6Å (Figure 1C). This is based on the query’s structural similarity to
the Ca2+-free conformation of another member of the family, the S100A2
protein [8], and the bound conformation of this protein [9]. The sequence
identity between the two proteins is 47%. When the number of clusters is
set to be larger than 2, each cluster represents either the Ca2+-bound or the
Ca2+-free conformation. On the other hand, using the abundant Ca2+-bound
conformation as a query, even with up to three clusters, the process
retrieves only variants of the (initial) bound conformation. Only when the
number of clusters is four or larger do we obtain at least one cluster
representing the Ca2+-free conformation. In general, the ability to predict
the other conformation improves as the number of clusters increases. For
example, with 17 clusters, 4 clusters represent the rare conformation, and
ConTemplate reproduces the Ca2+-free conformation with RMSD of 2.43Å
(Figure 1D). This is based on the query’s structural similarity to the bound
conformation of another member of the family, the S100A12 protein [10],
and the known free conformation of this protein [11]. The sequence identity
between the query and the template is 42%.
Conclusions: ConTemplate suggests putative conformations for a query
protein with at least one known structure, based on the query’s structural
similarity to other proteins. In principle, the clustering method enables the
detection of distinct conformations, including local conformational changes.
However, it may be necessary to adjust ConTemplate’s parameters to reveal
such changes, especially when looking for rare conformations. When
ConTemplate suggests models that are similar to the query, and the clusters
are very large, this may indicate that less-common conformations of the
query are masked by highly-abundant conformations. Increasing the number
of clusters may enable the rarer conformations to be detected. When the
additional conformation is not known, it is not trivial to detect the “correct”
conformation among the suggested models. A careful examination of the
similar proteins and their conformational changes can be useful towards
selecting the most probable conformations for the query. In addition, if the
number of clusters is large enough, a pathway between the query
conformation and a putative conformation may be found, with other models
serving as intermediates. Identification of such a pathway could provide
insight into the physiological relevance of a newly-detected conformation.
Acknowledgements: A.N. and H.A. are funded in part by the Edmond J.
Safra Center for Bioinformatics at Tel Aviv University
1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H,
Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000,
2. Narunsky A, Ben-Tal N: ConTemplate: exploiting the protein databank to
propose ensemble of conformations of a query protein of known
structure. BMC Bioinformatics 2014, 15(Suppl 3):A5.
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 4 of 10
Figure 1(abstract A3) ConTemplate results demonstrated using the S100A6 Ca2+ binding protein. The Ca2+-free (A) and -bound (B) conformations
are shown in the upper panels; helix 3 is marked in red, and the calcium ions in magenta. C. Reproducing the Ca2+-bound conformation, starting from
the Ca2+-free conformation as a query. The maximal RMSD between the query and similar proteins is set to 1.2Å, the minimal Q-score to 0.4, and the
number of clusters is set to 2. D. Reproducing the Ca2+-free conformation, starting from the Ca2+-bound conformation as a query. The similarity cutoffs
are the same as in C, the number of clusters is set to 17
BMC Bioinformatics 2015, Volume 16 Suppl 3
Krissinel E: Enhanced fold recognition using efficient short fragment
clustering. J Mol Biochem 2012, 1(2):76-85.
4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment
search tool. J Mol Biol 1990, 215(3):403-410.
5. Choi IG, Kwon J, Kim SH: Local feature frequency profile: a method to
measure structural similarity in proteins. Proc Natl Acad Sci USA 2004,
6. Sali A, Blundell TL: Comparative protein modelling by satisfaction of
spatial restraints. J Mol Biol 1993, 234(3):779-815.
7. Otterbein LR, Kordowska J, Witte-Hoffmann C, Wang CL, Dominguez R:
Crystal structures of S100A6 in the Ca(2+)-free and Ca(2+)-bound states:
the calcium sensor mechanism of S100 proteins revealed at atomic
resolution. Structure 2002, 10(4):557-567.
8. Koch M, Diez J, Fritz G: Crystal structure of Ca2+ -free S100A2 at 1.6-A
resolution. J Mol Biol 2008, 378(4):933-942.
9. Koch M, Fritz G: The structure of Ca2+-loaded S100A2 at 1.3-A resolution.
FEBS J 2012, 279(10):1799-1810.
10. Moroz OV, Antson AA, Grist SJ, Maitland NJ, Dodson GG, Wilson KS,
Lukanidin E, Bronstein IB: Structure of the human S100A12-copper
complex: implications for host-parasite defence. Acta Crystallogr D Biol
Crystallogr 2003, 59(Pt 5):859-867.
11. Moroz OV, Blagova EV, Wilkinson AJ, Wilson KS, Bronstein IB: The crystal
structures of human S100A12 in apo form and in complex with zinc:
new insights into S100A12 oligomerisation. J Mol Biol 2009,
Applications of proteochemometrics - from species extrapolation to cell
line sensitivity modelling
Isidro Cortes-Ciriano1*, Gerard JP van Westen2, Daniel S Murrell3,
Eelke B Lenselink4, Andreas Bender3, Therese E Malliavin1
Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825;
Département de Biologie Structurale et Chimie, 25, rue du Dr Roux, 75015,
Paris, France; 2ChEMBL Group, European Molecular Biology Laboratory
European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10
1SD, Hinxton, Cambridge, UK; 3Centre for Molecular Science Informatics,
Department of Chemistry, University of Cambridge, Cambridge, UK; 4Division
of Medicinal Chemistry, Leiden Academic Center for Drug Research, Leiden,
The Netherlands
BMC Bioinformatics 2015, 16(Suppl 3):A4
Background: Proteochemometrics (PCM) is a predictive bioactivity
modelling method which simultaneously models the bioactivity of
multiple ligands against multiple targets. PCM permits exploration of the
selectivity and promiscuity of ligands on biomolecular systems of
different complexity. This includes proteins and even cell-line models
[1,2]. The suitability of PCM to predict compound polypharmacology has
been validated both retrospectively and in prospective experimental
validation [1,2]. In practice, each ligand-target interaction is encoded by
the concatenation of ligand and target descriptor vectors used to train a
single machine learning model. The inclusion of both chemical and target
information enables the extra- and interpolation on the chemical and on
the biological space. Therefore, PCM permits to predict compound
bioactivities on targets not present in the training phase [3].
Results: In this contribution, we show a methodological advancement in
the field [4], namely how Bayesian inference (Gaussian Processes) can be
successfully applied in the context of PCM for (i) the prediction of
compound bioactivity along with the error estimation of the prediction; (ii)
the determination of the applicability domain of a PCM model; and (iii) the
inclusion of experimental uncertainty of bioactivity measurements. We
illustrate how the application of PCM can be useful in medicinal chemistry
to concomitantly optimize compounds selectivity and potency, in the
context of two application scenarios: (a) modelling isoform-selective
cyclooxygenase inhibition; and (b) large-scale cancer cell line drug sensitivity
prediction, where we benchmark the predictive signal of basal gene
expression, gene copy-number variation, exome sequencing, and protein
abundance data. We present the R package Chemically Aware Model
Builder (camb) [5], which is able to perform the above mentioned modelling
tasks. camb is an open source platform for the generation of StructureActivity and Structure-Property models. The functionalities of camb include:
(i) standardisation of chemical structure representation, (ii) calculation of 905
Page 5 of 10
one-dimensional descriptors and 14 fingerprints for small molecules, (iii) 8
types of amino acid descriptors, (iv) 13 whole protein sequence descriptors,
and (iv) training, validation and visualization of predictive models.
Conclusions: Overall, the application of PCM in these two case scenarios let
us conclude that PCM is a suitable technique, on this data, to model the
activity of ligands exhibiting diverse bioactivity profiles across a panel of
targets, which can range from protein binding sites (a), to cancer cell-lines
(b). The camb package constitutes a platform encompassing all steps for the
generation of predictive models from chemical structures and their
associated bioactivities/properties, which will provide reproducibility and
simplify the generation of predictive bioactivity/property models.
1. van Westen GJP, Wegner JK, Ijzerman AP, van Vlijmen HWT, Bender A:
Proteochemometric Modeling as a Tool to Design Selective Compounds
and for Extrapolating to Novel Targets. Med Chem Commun 2011, 2:16-30.
2. Cortes-Ciriano I, Ain QU, Subramanian V, Lenselink EB, Mendez-Lucio O,
Ijzerman AP, Wohlfahrt G, Prusis P, Malliavin TE, van Westen GJP, Bender A:
Polypharmacology Modelling Using Proteochemometrics (PCM): Recent
Methodological Developments, Applications to Target Families, and
Future Prospects. Med Chem Commun in press.
3. van Westen GJP, Wegner JK, Geluykens P, Kwanten L, Vereycken I,
Peeters A, Ijzerman AP, van Vlijmen HWT, Bender A: Which Compound to
Select in Lead Optimization? Prospectively Validated Proteochemometric
Models Guide Preclinical Development. PLoS ONE 2011, 6:e27518.
4. Cortes-Ciriano I, van Westen GJP, Lenselink EB, Murrell DS, Bender A,
Malliavin TE: Proteochemometric Modelling in a Bayesian framework. J
Cheminf 2014, 6:35.
5. Murrell DS, Cortes-Ciriano I, van Westen GJP, Stott IP, Bender A, Malliavin TE,
Glen RC: Chemically Aware Model Builder (camb): An R package for
property and bioactivity modeling of small molecules. [http://www.
An exploration of the 3D chemical space has highlighted a specific
shape profile for the compounds intended to inhibit protein-protein
Mélaine A Kuenemann1,2, Laura ML Bourbon1,2, Céline M Labbé1,2,3,
Bruno O Villoutreix1,2,3, Olivier Sperandio1,2,3*
Université Paris Diderot, Sorbonne Paris Cité, UMRS 973 Inserm, Paris 75013,
France; 2Inserm, U973, Paris 75013, France; 3CDithem, Faculté de Pharmacie,
1 rue du Prof Laguesse, 59000 Lille, France
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 3):A5
Background: The vital role of Protein-Protein Interactions (PPI) for Life
makes them the subject of a growing number of drug discovery projects.
Yet, the specific properties of PPI (often described as flat, large and
hydrophobic) require a dramatic paradigm shift in our way to design the
small compounds meant to modulate them with therapeutic perspectives.
To this end, successful inhibitors of PPI targets (iPPI) may be used to
discover what singular properties make this type of inhibitors capable of
binding to such intricate surfaces. Among the properties from which
lessons could be learnt, the 3D characteristics of iPPI have been
pinpointed as essential. Understanding the putative shape profile of iPPI
could help the design of a new generation of inhibitors.
Results: In an attempt to identify 3D characteristics, we have collected the
bioactive conformations of 84 orthosteric iPPI and compared them to those
of 1282 inhibitors of conventional targets (e.g enzymes) collectively from
different databases (2P2I[1], PDBbind[2], PDB). Because the known heavier
and more hydrophobic character of iPPI could conceal other characteristics,
we have imposed that none of the identified descriptors could correlate
with the hydrophobicity or the size of the compound. Four 3D
characteristics were highlighted (Figure 1). They describe either the shape of
the compounds (globularity) or the 3D distributions of the hydrophobic and
hydrophilic interacting regions of the compounds (IW4, EDmin3, CW2:
VolSurf descriptors [3]). More specifically the most essential property
revealed in the analysis (EDmin3) illustrates how iPPI manage to bind to the
hydrophobic patches often present at the core of PPI targets. The newly
identified properties were further confirmed as characteristic to iPPI using
the data of much larger datasets including our iPPI-DB[4], eDrugs3D[5] and a
representative subset of the bindingDB[6].
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 6 of 10
Figure 1(abstract A5) Bioactive conformation of compound 1MQ as cocrystallized with Mdm2 (pdb code 4JVE). The compound is represented as
transparent molecular surface and molecular sticks. The value of highlighted descriptors are : EDmin3 = -3.18 kcal/mol (represented by the green
molecular field calculated using Moe 2012.10 at the levels of energy equal to -2.4 kcal/mol using a dry probe), IW4 = 4.13 (represented by the pink
molecular field calculated using Moe 2012.10 at the levels of energy equal to -5.5 kcal/mol using a water probe), glob = 0.20 (represented by the
molecular surface), and CW2 = 1.90 (represented by the proportion of pink surface over the full molecular surface)
Conclusions: Identifying low-molecular-weight iPPI is known to be a
difficult task. This has usually been translated into designing compounds
with higher size, aromaticity, and hydrophobicity. Yet, lessons are being
learnt from iPPI bioactive conformations in an attempt to circumvent this
trend. During this analysis, we demonstrated that the capacity to bind a
protein-protein interface partially rely on the combination of several
structural and electrostatic features including the globularity and the
distribution of hydrophilic regions but most importantly of hydrophobic
interacting regions. More distinctively, iPPI seem to be characterized by a
significantly higher efficiency to bind the hydrophobic patches often present
at PPI interfaces. The absence of correlation of this type of property with the
hydrophobicity and the size of the compounds could open new ways to
design iPPI with improved ligand and lipophilic efficiencies and may allow
the scientific community to anticipate an era of more drug-like iPPI.
1. Basse MJ, Betzi S, Bourgeas R, Bouzidi S, Chetrit B, Hamon V, Morelli X,
Roche P: 2P2Idb: a structural database dedicated to orthosteric
modulation of protein-protein interactions. Nucleic acids research 2013, 41
Database: D824-827.
2. Wang R, Fang X, Lu Y, Wang S: The PDBbind database: collection of
binding affinities for protein-ligand complexes with known threedimensional structures. Journal of medicinal chemistry 2004,
3. Cruciani G, Pastor M, Guba W: VolSurf: a new tool for the
pharmacokinetic optimization of lead compounds. European journal of
pharmaceutical sciences : official journal of the European Federation for
Pharmaceutical Sciences 2000, 11(Suppl 2):S29-39.
4. Labbé CM, Laconde G, Kuenemann MA, Villoutreix BO, Sperandio O: iPPI-DB: a
manually curated and interactive database of small non-peptide inhibitors
of protein-protein interactions. Drug discovery today 2013, 18(19-20):958-968.
5. Pihan E, Colliandre L, Guichou JF, Douguet D: e-Drug3D: 3D structure
collections dedicated to drug repurposing and fragment-based drug
design. Bioinformatics 2012, 28(11):1540-1541.
Liu T, Lin Y, Wen X, Jorissen RN, Gilson MK: BindingDB: a web-accessible
database of experimentally determined protein-ligand binding affinities.
Nucleic acids research 2007, 35 Database: D198-201.
Mining the human proteome for conserved mechanisms
Stefan Naulaerts1,2*, Pieter Meysman1,2, Wim Vanden Berghe3, Kris Laukens1,2
ADReM research group, Department of Mathematics and Computer
Science, University of Antwerp, Belgium; 2Biomedical Informatics Research
Center Antwerp (biomina), University of Antwerp/Antwerp University
Hospital, Belgium; 3Laboratory of Protein Science, Proteomics and Epigenetic
Signaling (PPES), Department of Biomedical Sciences, University of Antwerp,
BMC Bioinformatics 2015, 16(Suppl 3):A6
Background: All cells are subject to ever-changing environments to which
they have to adapt, using their sensory system to provide input for the
regulatory systems that integrate the information and trigger the eventual
effectors. These cascades constitute a very complex cellular wiring that is
highly relevant due to its medical importance. The omni-present
application of high-throughput analysis techniques has resulted in an
unprecedented level of available detail about gene expression and various
aspects of cellular proteins, such as abundance, function and localization,
often captured in well-curated compendia that are publicly available.
Although these information-rich inventories exist, the adaptive nature of
protein complexes and signalling cascades remain poorly understood, as the
current predominant approaches are not always suited to describe the
associations between proteins. For example, binary protein interactions do not
necessarily occur in vivo as the proteins could be expressed in different
compartments of the cell or at different time points. This severely complicates
the analysis of any protein interaction data. It thus remains a challenge to find
out how biological entities cooperate to regulate cellular response to stimuli.
Methods: We used an integrative method, reliant on advanced pattern
mining approaches to gain a deeper understanding of protein network
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 7 of 10
dynamics. To this end, we created a compendium consisting of a large
amount of proteomics papers for Homo sapiens that report differentially
expressed proteins in cell lines. Next, we analysed this collection with
frequent itemset mining to identify proteins that are often co-occurring
in publications and used these patterns as the backbone structure of our
further analysis. These patterns of co-occurring proteins were enriched
with additional attributes, such as gene expression correlation, protein
localization and functional coherence metrics derived from the Gene
Ontology tree [1] and used as a filter on top of an integrated binary
protein interaction network, obtained by fusing several of the most
popular resources.
Results: We found that several proteins and GO-functions, such as
transcriptional regulation, are consistently reported and deemed
significant regardless of the research topic. Furthermore, we were able to
find associations across the various “omics” levels that are conserved in a
wide range of human cancers and managed to identify lists of frequently
occuring patterns that can be used to classify between pre- and postmetastasic tumour development.
Conclusions: Pattern-based analysis on multiple “omics” levels can be
used to identify the cellular logic circuits and holds many promising
applications in the biotechnological and biomedical areas.
1. Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J, Davis A,
Dolinski K, Dwight S, Eppig J, et al: Gene ontology: tool for the unification
of biology. Nat Genet 2000, 25(1):25-29.
Furthermore, the hybrid weighted degree kernel (r = 0.234) outperformed
the weighted degree kernel with shifts (r = 0.22) by also considering the
frequencies of individual bases in addition to the consensus sequences.
Non-sequence features were less predictive of the outcome than the
sequence, e.g., RBF kernels on base quality and depth of coverage attained
only correlations of r = 0.057 and r = 0.003 with the outcome, respectively.
Conclusion: To our knowledge, this is the first approach indicating that
differences between methylation measurements from bisulfite sequencing
and the Infinium HumanMethylation450 microarray are predictable from
the reads. The results suggest that features beside the sequence play
only a minuscule role in the emergence of inconsistent methylation
measurements. We were able to show that, in this scenario, set kernels
and hybrid string kernels provide well-suited similarity measures. Further
work is necessary to validate the model’s generalizability for data from
other cell lines and to evaluate its practical merit.
Acknowledgements: Gilles Gasparoni and Karl Nordström were funded
by the BMBF project 01KU1216F (DEEP). Pavlo Lutsik was funded by the
European Union’s Seventh Framework Programme (FP7/2007-2013) grant
agreement No. 267038 (NOTOX)
1. Dedeurwaerder S, Defrance M, Calonne C, Denis H, Sotiriou C, Fuks F:
Evaluation of the Infinium Methylation 450K technology. Epigenomics
2011, 3(6):771-784.
2. Liu Y, Siegmund KD, Laird PW, Berman BP, et al: Bis-SNP: Combined DNA
methylation and SNP calling for Bisulfite-seq data. Genome Biol 2012,
3. Assenov Y, Müller F, Lutsik P, Walter J, Lengauer T, Bock C: Comprehensive
Analysis of DNA Methylation Data with RnBeads. Nat Methods in press.
4. Teschendorff AE, et al: A beta-mixture quantile normalization method for
correcting probe design bias in Illumina Infinium 450K DNA methylation
data. Bioinformatics 2013, 29(2):189-196.
5. Sonnenburg S, Rätsch G, Schäfer G: Learning interpretable SVMs for
biological sequence classification. Research in Computational Molecular
Biology Springer 2005, 389-407.
6. Rätsch G, Sonnenburg S, Schölkopf B: RASE: recognition of alternatively
spliced exons in C. elegans. Bioinformatics 2005, 21(suppl 1):i369-i377.
7. Meinicke P, Tech M, Morgenstern B, Merkl R: Oligo kernels for datamining
on biological sequences: a case study on prokaryotic translation
initiation sites. BMC Bioinformatics 2004, 5(1):169.
8. Gärtner T, Flach PA, Kowalczyk A, Smola AJ: Multi-Instance Kernels.
Proceedings of 19th International Conference on Machine Learning San Mateo,
CA: Morgan Kaufman 2002, 179-186, Edited by Sammut C, Hoffmann A.
Identification and analysis of methylation call differences between
bisulfite microarray and bisulfite sequencing data with statistical
learning techniques
Matthias Döring1*, Gilles Gasparoni2, Jasmin Gries2, Karl Nordström2,
Pavlo Lutsik2, Jörn Walter2, Nico Pfeifer1
Department of Computational Biology and Applied Algorithmics, Max
Planck Institute for Informatics, Campus E1 4, 66123 Saarbrücken, Germany;
Department of Genetics/Epigenetics, Saarland University, Saarbrücken,
BMC Bioinformatics 2015, 16(Suppl 3):A7
Background: DNA methylation is an epigenetic modification known to play
a prime role in gene silencing and is an important topic in epigenetic
research. However, due to technology-dependent errors there are
inconsistencies between methylation measurements from different methods
[1]. Incorrect methylation calls could result in the discovery of spurious
associations between methylation patterns and specific phenotypes in
epigenome-wide association studies (EWAS). We worked towards assigning
a measure of confidence to individual CpGs to down-weigh or exclude
positions with inconsistent measurements in such studies. We used
methylation measurements from the Infinium HumanMethylation450
microarray (b450K) and whole genome bisulfite sequencing (bWGBS) to
evaluate whether locus-specific measurement differences, Δb = b450K −
bWGBS, are predictable using statistical learning techniques.
Methods: Methylation for Illumina WGBS data from HepaRGd7R2 was
called with Bis-SNP [2], while methylation for Infinium 450K data from the
same cell line was determined using RnBeads [3] and normalized with
BMIQ [4]. For a uniform feature representation, we considered windows
of reads overlapping with CpGs on the microarray (Figure 1). As
predictors we examined sets of read sequences, their consensus
sequences (with and without base frequencies), and non-sequence
features such as base quality and depth of coverage. To obtain a
predictive model independent of the methylation state, we masked CpG
positions by introducing gaps or zeroing base frequencies.
To predict Δb, we built support vector regression models based on
Illumina WGBS data. Read similarity was measured with numerical, string
[5-7], and set kernels [8]. We introduced the notion of hybrid string
kernels to afford a similarity measure for both numeric and string input
simultaneously. These kernels are based on scaling the motif similarity
scores of two sequences according to the similarity of their base
frequency profiles.
Results: For a read-based set kernel utilizing the weighted degree kernel
with shifts [6], we found that the predicted values of Δb correlated
significantly with the observed outcomes (r = 0.37, p-value < 2.2 · 10−16).
Hybrid approaches for the detection of networks of critical residues
involved in functional motions in protein families
Dagoberto Armenta-Medina1*, Ernesto Perez-Rueda1,2
Departamento de Ingeniería Celular y Biocatálisis, Instituto de Biotecnología,
UNAM Av. Universidad 2001, Cuernavaca, Morelos CP 62210, México;
Unidad Multidisciplinaria de Docencia e Investigación, Sisal Facultad de
Ciencias, UNAM, Sisal, Yucatán, México
E-mail: [email protected]
BMC Bioinformatics 2015, 16(Suppl 3):A8
Background: Currently there is great interest in identifying critical residues
in proteins, to improve our understanding and allow for the engineering of
protein families. Diverse approaches combine sequence information,
structural data, dynamics analysis and functional description to determine
the importance of amino acids with regards to protein function. In this work,
we propose a hybrid approach for the identification of critical residues in
proteins, combining the use of evolutionary information (co-evolution),
cross-correlation of atomic fluctuations derived from Anisotropic Normal
Mode Analysis simulations [1] (ANMA) and network analysis. Subsequently
we have compared this method to existing approaches.
Results: By combining the information of the covariance matrix derived
from Statistical Coupling Analysis (SCA) [2] and the cross-correlation matrix
of atomic fluctuations derived from ANMA, it was possible to identify a
network of evolutionarily coupled residues involved in relevant motions in
protein families. The outstanding sites revealed by our hybrid approach
(ANMA.SCA) showed a high correspondence with experimental data,
confirming the critical role of these sites in the functional mobility of
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 8 of 10
Figure 1(abstract A7) Data preprocessing. (1) Only reads overlapping with a CpG on the Infinium 450K chip are retained. (2) Windows are extended to
the left and right of each CpG according to the maximum read length, yielding a uniform feature representation. (3) For each CpG, a consensus
sequence is formed from its corresponding set of reads. Additionally, the position-specific frequency of each base is extracted. (4) Finally, CpG positions
are masked by introducing gaps in the sequence or zeroing frequencies
proteins. In addition, our approach was found to be complementary to
previous approaches. It maintained a good correspondence with
approaches derived from extensive molecular dynamics, while being faster
and less expensive in terms of computational resources [3].
Conclusions: The hybrid approach ANMA.SCA opens a wide range of
possibilities in the study of functional motion within protein families. By
means of detecting networks of critical sites and their topology it is able
to reveal the hidden aspects of protein dynamics.
Acknowledgements: DA-M acknowledges the PhD fellowship (35083)
from CONACYT and (IN-204714) DGAPA. EP-R was supported by a grant
(IN-204714) from DGAPA and (155116) from CONACYT
1. Eyal E, Yang LW, Bahar I: Anisotropic network model: systematic
evaluation and a new web interface. Bioinformatics 2006,
2. Socolich M, Lockless SW, Russ WP, Lee H, Gardner KH, Ranganathan R:
Evolutionary information for specifying a protein fold. Nature 2005,
3. Armenta-Medina D, Perez-Rueda E, Segovia L: Identification of functional
motions in the adenylate kinase (ADK) protein family by computational
hybrid approaches. Proteins 2011, 79(5):1662-1671.
Improving duplicated nodes position in vertebrate gene trees
Amélie Peres*, Hugues Roest Crollius
Ecole Normale Supérieure, Institut de Biologie de l’ENS, IBENS, France
BMC Bioinformatics 2015, 16(Suppl 3):A9
Background: While gene phylogenies are essential for many biological
evolutionary studies, phylogenetic reconstructions are difficult to model,
especially when they include gene duplications. In this study, we have
developed a method to improve the positions of duplications in gene
trees produced by TreeBest, a widely used method at the core of the
“Ensembl compara” pipeline[1].
Results: In order to automatically identify incorrectly positioned
duplications, we investigated a method that relies on the confidence
score, a measure between 0 and 1 introduced by TreeBest that is
assigned to each duplication node. This score reflects the ratio between
the number of species with a duplicated gene and the total number of
species derived from this node. A well-supported duplication will thus
have a score closer to 1.
With our method, if a duplication node is considered to be poorly
supported it is replaced by a speciation node, and the duplication is
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 9 of 10
moved to the following node which is tested using the same method. If
the new duplication node passes the test, the duplication is maintained
at this new position in the tree.
To test our method comprehensively, we ran it on all 20194 phylogenetic
trees available in the Ensembl compara database version 71. The
resulting 20194 new edited gene trees were then compared with the
original Ensembl gene trees by feeding both databases to AGORA[2], an
algorithm developed in our laboratory to reconstruct ancestral gene
orders. This tool allowed us to assess the quality of the new gene trees
as its performances are very sensitive to the quality of the input gene
trees, in particular because the length of the reconstructed ancestral
chromosomal regions varies substantially depending on the quality of the
input gene trees.
With the Ensembl gene trees, the number of ancestral genes increases
and decreases rapidly during time, whereas with edited gene trees, the
number of genes is more constant (Figure 1), which is more likely from
an evolutionary perspective. Additionally, in some cases the number of
ancestral genes is more reasonable. Such is the case for the common
ancestor for primates and rodents, Boreoeutheria, where its genome
reconstruction with the Ensembl gene trees has 30 000 genes, but its
genome reconstructed with our edited gene trees is only 20 000 genes
large. The latter value is much closer to what one would expect because
all modern Boreoeutheria descendant genomes contain between 20 000
and 25 000 genes.
We also test the N50 measurement, which is the size of an ancestral
block such as 50% of genes are in larger blocks, for all reconstructed
ancestral genomes. A higher N50 indicates a better ancestral genome
reconstruction. Edited gene trees using our confidence score method
significantly improve the N50 and most notably with a threshold of 0.3
that was obtained empirically (Figure 2).
Figure 1(abstract A9) Number of genes in ancestral genomes obtained with the original Ensembl gene trees database (in blue) and with
edited gene trees with the confidence score method and a threshold of 0.3 (in red)
BMC Bioinformatics 2015, Volume 16 Suppl 3
Page 10 of 10
Figure 2(abstract A9) N50 measurement for the Boreoeutheria genome reconstruction with the original Ensembl gene trees database (in blue)
and with our edited gene trees with the confidence score method (in red). Edited trees significantly improve the N50. The optimal threshold is 0.3.
Results are similar for all other ancestral genomes
Conclusions: We find that using the confidence score method significantly
improves the positions of duplications within gene trees when compared to
the initial Ensembl gene tree database. The optimal value is obtained with a
threshold score of 0.3, at which 39% of the 197 894 duplication nodes of the
Ensembl gene tree database are edited, resulting in an increase in the N50
length for the ancestral reconstruction of the 58 vertebrate ancestors. These
results suggest that our improved gene trees are more reliable.
1. Flicek P, Amode MR, Barrell D, Beal K, Brent S, et al: Ensembl 2012. Nucleic
Acids Res 2011, 40:D84-90.
Muffato M: Reconstruction de génomes ancestraux chez les Vertébrés.
PhD Thesis 2010.
Cite abstracts in this supplement using the relevant abstract number,
e.g.: Peres and Crollius: Improving duplicated nodes position in
vertebrate gene trees. BMC Bioinformatics 2015, 16(Suppl 3):A9