How did you get started working in bioinformatics and protein
engineering, considering how new the field itself is?
Well, originally I was a biologist. But my master’s project
was in bioinformatics, which was very unusual in the early
1990s. The term bioinformatics, covering what it means today,
wasn’t even really coined at that time. My personal motivation
was that I liked biology, but I also liked working on computers,
and I preferred to do something where I could use the computer
as a tool instead of the lab bench.
What prompted you to take on this identification of signal
peptides and the prediction of cleavage sites that later became your
1997 Protein Engineering paper?
|
 |
|
|
“ ...given a sequence of amino acids, how do
we predict whether this protein is destined for the outside of
the cell and, if it is, what’s the signal peptide and where is
this peptide cleaved from the remainder of the protein?” |
|
|
This was the main part of my doctoral thesis, but also a
direct continuation of my master’s thesis. And the answer is
pretty simple. My supervisors, Søren Brunak and Jacob
Engelbrecht, had a list of suggestions of
projects that they wanted done. I looked through the list and
chose this project because the problem sounded interesting. And,
in contrast to some of the other problems, it sounded realistic.
There seemed to be enough data available to do it. That could be
a real problem for bioinformatics at that time, because
databases were far smaller than they are today.
So what did you set out to accomplish in this paper, and how did
you go about doing it?
Let me answer that in two parts. First, I’ll explain what the
problem is biologically and then I’ll talk about what I did on
the computer to deal with it. Biologically the situation is that
some proteins are needed outside the cell, and this is the case
both in eukaryotic and prokaryotic cells. It is a very general
phenomenon. Those proteins that are destined to go outside the
cell have what’s called a signal peptide that serves as a
tag—you could call it a zip code—the function of which is to
signify that this protein is going to be exported across the
membrane. During export, the signal peptide is cleaved off, and
it’s not part of the finished or mature protein.
The problem is, given a sequence of amino acids, how do we
predict whether this protein is destined for the outside of the
cell and, if it is, what’s the signal peptide and where is this
peptide cleaved from the remainder of the protein? So there are
two questions to be answered here. One, in effect, is whether
there is a signal peptide there at all. And, two, if there is a
signal peptide, where exactly will it be cleaved off?
So how did you go about solving these problems?
First I used a dataset that was manually compiled—extracted
from scientific papers, sequence by sequence—by Gunnar von
Heijne, who later became my doctoral thesis advisor. Then I
realized that a lot of signal peptides had been investigated
since he put together that dataset, so I also compiled a dataset
extracted from the Swiss-Prot database, an international protein
sequence database that’s now part of the UniProt database.
Swiss-Prot contained amino acid sequences and annotations of
various kinds. One of those annotations would be whether there
is a signal peptide and how long it is. So that gave me the data
sequences on one hand and the information about signal peptides
and cleavage sites on the other. Then the computational task was
to find the correlation between those two. And I did that using
a computational method known as artificial neural networks,
which my supervisors had earlier used with success to predict,
e.g., intron splice sites.
What was the most challenging part of this project, the biggest
obstacle to pulling it off?
There were two really challenging aspects. One was to verify
that the dataset was of sufficiently high quality, because the
annotations in a database like Swiss-Prot were not always
correct. So I had to invent various kinds of quality control to
get only the best annotations. One aspect of that is that I also
realized it was necessary to reduce homology in my dataset. That
means if I train my prediction methods on part of the dataset
and then test whether the prediction works on another part of my
dataset, and those two parts of my dataset contain sequences
that are very closely related, I may actually be cheating. In
that case, I haven’t shown that my network is able to pull out
the general features of the problem. I’ve only shown that it’s
capable of recognizing the same examples that it had seen during
training. So it was necessary to clean out the dataset so that
it didn’t contain pairs of very similar sequences.
The second challenging aspect was getting the neural network
to behave as I wanted it to. In principle, a neural network
learns its examples automatically, just by having them presented
to it. You present all the examples several times to the
network, and then it works to generalize. But there are a lot of
choices involved in this. There are many free parameters
concerning how the network is built—what’s called the
architecture—and also how the network is trained. You can train
it fast or slow; you can train it in one go or in several steps;
you can train it for a long time or stop the training early. All
those choices give different results and I had to come up with a
way to choose between these options.
How did you ultimately choose?
Well, in the end I compared the performance on datasets that
had not been involved in the training. It was as a part of that
process that I became aware that my comparisons would not be
valid unless I reduced my dataset, cleaned out the homologous
proteins. I had to find out about this reduction and then do the
whole comparison over again.
Were you aware of how remarkably well cited this project would
become while you were doing it?
No I wasn’t. When I started the project, I saw it mainly as a
basic research project and didn’t think very much about
applications. In late 1996, we put this method on the Internet,
as both a web server and a mail server. You could write your
amino acid sequences in an email and mail it to a specific
address and then after a half-hour or so you would get a
prediction back by email. That was a common way to do it at that
time, when web browsers were not something everybody naturally
had. So when we put this method on the Internet, far more users
than I had ever imagined began submitting sequences and actually
using my method. Only then did I realize that there had been a
very strong need for such a method.
Did you have any active competitors at the time? Have any more
come around in the decade since then?
My greatest competitor was actually my Ph.D. supervisor,
Gunnar von Heijne, but his method was already 10 years old at
that time. The paper, "A new method for predicting signal
sequence cleavage sites," had been published in 1986 in
Nucleic Acids Research (14 [11]: 4683-4690, 11 June 1986),
and it had also been a very highly cited paper. The problem was
that his method was not publicly available. His 1986 paper only
described the algorithm and the weight matrix used to implement
it, but then it was up to users to actually do the
implementation. So it required that the users have programming
skills. By putting my system on the Internet in 1996, I
eliminated this need for users to be able to program. Since then
I’ve had other competitors, but never anybody who could
consistently show that their performance was better than mine.
Since 1997, you’ve revised the program twice. What do these
revisions accomplish and how much more accurate is the latest
version compared to the first version?
We’ve accomplished some improvements in predictive
performance. The magnitude of the improvements varies for
different groups of organisms, but the most drastic improvement
was a seventeen percent better cleavage site location for
Gram-positive bacteria in version three, which was primarily
made by Jannick D. Bendtsen, at that time a Ph.D. student in my
group. One important improvement in version two was that we also
became able to deal with a limited number of transmembrane
proteins.
One difficult aspect of this problem is that those signal
peptides that signify export from the cell are quite similar to
that portion of transmembrane proteins that are not cleaved, but
are left hanging in the membrane as part of the membrane
embedded protein. This means that when you take a whole
proteomic set and try to predict signal peptides from that, you
get poor discrimination between secretory proteins—those that
have their function outside the cell—and integral membrane
proteins, which stay as a part of the membrane. We partly solved
this problem in version two but, when I say "partly," it’s
because we concentrated only on a subgroup of membrane proteins.
This is the problem I’m attacking in the next version of the
program, version four, which I’ll be working on this autumn.
So you think you know how to solve it?
Well, it will be far from 100%, but it will be much better
than we’re capable of doing now.
Do you also do research that is unrelated to this particular
problem of identifying signal proteins and cleavage sites?
Not a lot. Most of my other projects have always been related
to this one, because they’ve touched on other aspects of protein
solving or protein localization. For instance, I have
participated in the prediction of which proteins go into
mitochondria or go into chloroplasts. Those have their own zip
codes. I’ve also been involved in a general approach for
predicting protein function from sequence, but my role in that
project was to see what role the subcellular location of a
protein might play in the prediction of the functional class to
which it belongs. The two exceptions to this are papers I
published concerning the prediction of eukaryotic start codons
and evolution of introns.
Were there any unexpected or serendipitous events that arose in
the course of your research? Was there any way that you just got
lucky?
I would say the only way I just got lucky was in choosing
this project without knowing before hand how important it was.
That was serendipitous.
Henrik Nielsen, Ph.D.
Center for Biological Sequence Analysis
BioCentrum-DTU
Technical University of Denmark
Lyngby, Denmark