The aim of bioinformatics is to apply computational approaches and
information technology to facilitate the organisation and analysis of
biological information. One of the most exciting aspects of
bioinformatics is that it is by definition an interdisciplinary field
of science, combining molecular biology, information technology, and
mathematics into a single discipline. I initially trained in
mathematics and computer science, but since the beginning of my work
in bioinformatics at the EMBL in Heidelberg and afterwards at the
IGBMC in Strasbourg, I have been lucky enough to be able to work in
close day-to-day contact with expert molecular biologists. My interest
has focused on the analysis of protein sequences, not as a series of
individual letters, but because of what they represent; proteins with
complex three-dimensional structures, biological functions, and
interactions with other proteins.
One of the cornerstones of modern bioinformatics is the comparison
or alignment of protein sequences. Pairwise comparisons are used to
search the sequence databases for homologous proteins that share a
common evolutionary history and often have similar 3D structures and
biological functions. With the aid of multiple sequence alignments of
the various members of a complete protein family, scientists are able
to study the sequence patterns conserved through evolution and the
ancestral relationships between different organisms. A close
collaboration between biologists and computer scientists is one of the
main reasons for the success of our multiple sequence alignment
programs, ClustalW and ClustalX and our more recent developments, in
particular DbClustal and NorMD. Only expert biologists working in the
field can evaluate the results produced by automatic methods in terms
of their biological accuracy and significance. Often experimental
evidence is required to confirm the predictions made by computer
programs, otherwise they remain only unproved hypotheses. Thus, the
association of bioinformatics groups with structural biologists and
biochemists is essential. But mathematics and information technology
can contribute different, complementary skills, such as rigorous
mathematical theories and computer engineering techniques for program
design and development. Although this is sometimes considered to be of
secondary importance by theoretical mathematicians, computer programs
will not be used by most biologists if they are not easy to install,
robust, and user-friendly. Another crucial aspect of the development
of a successful program is the use of a suitable test system to
validate the results. We developed the BAliBASE benchmark alignment
database specifically for the evaluation of multiple alignment
methods. It has been a crucial factor in the development and
validation of our new methods and has become a standard reference for
others working in the field.
Recent advances in genome sequencing technologies and gene
expression analysis using microarrays have led to an explosive growth
in the amount of biological information publicly available. There is
now an urgent requirement for the creation and management of
information databases and automatic tools to analyse and interpret the
various types of data. In particular, the multiple sequence alignment
problem has been the subject of renewed interest from computer
scientists and mathematicians, and a number of different algorithms
have been exploited in the search for more accurate alignments,
including iterative refinement techniques, genetic algorithms, and
Hidden Markov Models. Nevertheless, it is generally acknowledged that
a human expert is still required to validate and correct the automatic
alignments in more difficult cases. No single algorithm exists today
that can cope with the highly complex proteins detected by today’s
database search programs, and it is clear that an accurate,
biologically meaningful alignment can no longer be constructed from
the primary sequence data alone. In order to fully understand the
functions and molecular interactions of a particular gene, it will be
essential to assemble, validate, and classify such diverse information
as cellular location, degradation and modification, three-dimensional
structures, mutations, and their associated illnesses. Much of this
data is itself predicted by computerised methods and is therefore
inherently unreliable. Validation of this information will be of
primary importance to the successful development of new
sequence-analysis tools in the years to come. Current research in our
laboratory is therefore moving towards a co-operative, knowledge-based
approach bringing together pertinent information and complementary
algorithmic techniques into a single, integrated system.
Dr. Julie Thompson-Maaloum
Institut de Génétique et de Biologie Moléculaire et Cellulaire
Illkirch, France