Beginning in mid-February 2008, the 1997-2007 online version of the Science Watch® newsletter, ESI-Topics.com, and in-cites.com, will all be featured together on the redesigned ScienceWatch.com. All previous content from the three sites will be permanently archived, and remain accessible from any existing bookmarks to the archived pages. No new content will be added to this site. Updates and new content (updated biweekly) are available at ScienceWatch.com now.
The Thomson Corporation inin-cites logoites
ScientistsPapersInstitutionsJournalsCountriesH O M ERSS feeds


S E A R C H
incites



PAPERS

Scientists
Papers
Institutions
Journals
Countries
 

The Top 10...
Analysis of...
Site Map by Fields
Overview Menu of all Interviews
Podcasts
Hot Papers published within the last 2 years
Current Classics
SCI-BYTES - What's New in Research
What's New in Research

in-cites, September 2007
 http://www.in-cites.com/papers/HenrikNielsen.html

Papers

             
An interview with:
Dr. Henrik Nielsen
           

This month, in-cites correspondent Gary Taubes talks with Dr. Henrik Nielsen about his paper, "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites" (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997). With 3,144 citations to date, this paper ranks at #6 among Biology & Biochemistry papers in Essential Science Indicators, and at #22 over all fields in the database. Dr. Nielsen’s work can be found in the fields of Biology & Biochemistry, Molecular Biology & Genetics, Clinical Medicine, and Computer Science. Dr. Nielsen is a Senior Researcher at the Technical University of Denmark.

  How did you get started working in bioinformatics and protein engineering, considering how new the field itself is?

Well, originally I was a biologist. But my master’s project was in bioinformatics, which was very unusual in the early 1990s. The term bioinformatics, covering what it means today, wasn’t even really coined at that time. My personal motivation was that I liked biology, but I also liked working on computers, and I preferred to do something where I could use the computer as a tool instead of the lab bench.

  What prompted you to take on this identification of signal peptides and the prediction of cleavage sites that later became your 1997 Protein Engineering paper?


“ ...given a sequence of amino acids, how do we predict whether this protein is destined for the outside of the cell and, if it is, what’s the signal peptide and where is this peptide cleaved from the remainder of the protein?”


This was the main part of my doctoral thesis, but also a direct continuation of my master’s thesis. And the answer is pretty simple. My supervisors, Søren Brunak and Jacob Engelbrecht, had a list of suggestions of projects that they wanted done. I looked through the list and chose this project because the problem sounded interesting. And, in contrast to some of the other problems, it sounded realistic. There seemed to be enough data available to do it. That could be a real problem for bioinformatics at that time, because databases were far smaller than they are today.

  So what did you set out to accomplish in this paper, and how did you go about doing it?

Let me answer that in two parts. First, I’ll explain what the problem is biologically and then I’ll talk about what I did on the computer to deal with it. Biologically the situation is that some proteins are needed outside the cell, and this is the case both in eukaryotic and prokaryotic cells. It is a very general phenomenon. Those proteins that are destined to go outside the cell have what’s called a signal peptide that serves as a tag—you could call it a zip code—the function of which is to signify that this protein is going to be exported across the membrane. During export, the signal peptide is cleaved off, and it’s not part of the finished or mature protein.

The problem is, given a sequence of amino acids, how do we predict whether this protein is destined for the outside of the cell and, if it is, what’s the signal peptide and where is this peptide cleaved from the remainder of the protein? So there are two questions to be answered here. One, in effect, is whether there is a signal peptide there at all. And, two, if there is a signal peptide, where exactly will it be cleaved off?

  So how did you go about solving these problems?

First I used a dataset that was manually compiled—extracted from scientific papers, sequence by sequence—by Gunnar von Heijne, who later became my doctoral thesis advisor. Then I realized that a lot of signal peptides had been investigated since he put together that dataset, so I also compiled a dataset extracted from the Swiss-Prot database, an international protein sequence database that’s now part of the UniProt database. Swiss-Prot contained amino acid sequences and annotations of various kinds. One of those annotations would be whether there is a signal peptide and how long it is. So that gave me the data sequences on one hand and the information about signal peptides and cleavage sites on the other. Then the computational task was to find the correlation between those two. And I did that using a computational method known as artificial neural networks, which my supervisors had earlier used with success to predict, e.g., intron splice sites.

  What was the most challenging part of this project, the biggest obstacle to pulling it off?

There were two really challenging aspects. One was to verify that the dataset was of sufficiently high quality, because the annotations in a database like Swiss-Prot were not always correct. So I had to invent various kinds of quality control to get only the best annotations. One aspect of that is that I also realized it was necessary to reduce homology in my dataset. That means if I train my prediction methods on part of the dataset and then test whether the prediction works on another part of my dataset, and those two parts of my dataset contain sequences that are very closely related, I may actually be cheating. In that case, I haven’t shown that my network is able to pull out the general features of the problem. I’ve only shown that it’s capable of recognizing the same examples that it had seen during training. So it was necessary to clean out the dataset so that it didn’t contain pairs of very similar sequences.

The second challenging aspect was getting the neural network to behave as I wanted it to. In principle, a neural network learns its examples automatically, just by having them presented to it. You present all the examples several times to the network, and then it works to generalize. But there are a lot of choices involved in this. There are many free parameters concerning how the network is built—what’s called the architecture—and also how the network is trained. You can train it fast or slow; you can train it in one go or in several steps; you can train it for a long time or stop the training early. All those choices give different results and I had to come up with a way to choose between these options.

  How did you ultimately choose?

Well, in the end I compared the performance on datasets that had not been involved in the training. It was as a part of that process that I became aware that my comparisons would not be valid unless I reduced my dataset, cleaned out the homologous proteins. I had to find out about this reduction and then do the whole comparison over again.

  Were you aware of how remarkably well cited this project would become while you were doing it?

No I wasn’t. When I started the project, I saw it mainly as a basic research project and didn’t think very much about applications. In late 1996, we put this method on the Internet, as both a web server and a mail server. You could write your amino acid sequences in an email and mail it to a specific address and then after a half-hour or so you would get a prediction back by email. That was a common way to do it at that time, when web browsers were not something everybody naturally had. So when we put this method on the Internet, far more users than I had ever imagined began submitting sequences and actually using my method. Only then did I realize that there had been a very strong need for such a method.

  Did you have any active competitors at the time? Have any more come around in the decade since then?

My greatest competitor was actually my Ph.D. supervisor, Gunnar von Heijne, but his method was already 10 years old at that time. The paper, "A new method for predicting signal sequence cleavage sites," had been published in 1986 in Nucleic Acids Research (14 [11]: 4683-4690, 11 June 1986), and it had also been a very highly cited paper. The problem was that his method was not publicly available. His 1986 paper only described the algorithm and the weight matrix used to implement it, but then it was up to users to actually do the implementation. So it required that the users have programming skills. By putting my system on the Internet in 1996, I eliminated this need for users to be able to program. Since then I’ve had other competitors, but never anybody who could consistently show that their performance was better than mine.

  Since 1997, you’ve revised the program twice. What do these revisions accomplish and how much more accurate is the latest version compared to the first version?

We’ve accomplished some improvements in predictive performance. The magnitude of the improvements varies for different groups of organisms, but the most drastic improvement was a seventeen percent better cleavage site location for Gram-positive bacteria in version three, which was primarily made by Jannick D. Bendtsen, at that time a Ph.D. student in my group. One important improvement in version two was that we also became able to deal with a limited number of transmembrane proteins.

One difficult aspect of this problem is that those signal peptides that signify export from the cell are quite similar to that portion of transmembrane proteins that are not cleaved, but are left hanging in the membrane as part of the membrane embedded protein. This means that when you take a whole proteomic set and try to predict signal peptides from that, you get poor discrimination between secretory proteins—those that have their function outside the cell—and integral membrane proteins, which stay as a part of the membrane. We partly solved this problem in version two but, when I say "partly," it’s because we concentrated only on a subgroup of membrane proteins. This is the problem I’m attacking in the next version of the program, version four, which I’ll be working on this autumn.

  So you think you know how to solve it?

Well, it will be far from 100%, but it will be much better than we’re capable of doing now.

  Do you also do research that is unrelated to this particular problem of identifying signal proteins and cleavage sites?

Not a lot. Most of my other projects have always been related to this one, because they’ve touched on other aspects of protein solving or protein localization. For instance, I have participated in the prediction of which proteins go into mitochondria or go into chloroplasts. Those have their own zip codes. I’ve also been involved in a general approach for predicting protein function from sequence, but my role in that project was to see what role the subcellular location of a protein might play in the prediction of the functional class to which it belongs. The two exceptions to this are papers I published concerning the prediction of eukaryotic start codons and evolution of introns.

  Were there any unexpected or serendipitous events that arose in the course of your research? Was there any way that you just got lucky?

I would say the only way I just got lucky was in choosing this project without knowing before hand how important it was. That was serendipitous.End of interview

Henrik Nielsen, Ph.D.
Center for Biological Sequence Analysis
BioCentrum-DTU
Technical University of Denmark
Lyngby, Denmark

Dr. Henrik Nielsen's most-cited paper with 3,144 cites to date:
Nielsen H, et al., "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites," Protein Eng. 10(1): 1-6, January 1997. Source: Essential Science Indicators.
 

in-cites, September 2007
 http://www.in-cites.com/papers/HenrikNielsen.html


ScienceWatch.com - Tracking Trends and Perfomance in Basic Research
Go to the new ScienceWatch.com

Home | Search | Disclaimer | Terms of Use | Privacy Policy | Copyright
Contact Webmaster with questions/comments |
(c) 2008 The Thomson Corporation.