Beginning in mid-February 2008, the 1997-2007 online version of the Science Watch® newsletter, ESI-Topics.com, and in-cites.com, will all be featured together on the redesigned ScienceWatch.com. All previous content from the three sites will be permanently archived, and remain accessible from any existing bookmarks to the archived pages. No new content will be added to this site. Updates and new content (updated biweekly) are available at ScienceWatch.com now.
The Thomson Corporation inin-cites logoites
ScientistsPapersInstitutionsJournalsCountriesH O M ERSS feeds


S E A R C H
incites



PAPERS

Scientists
Papers
Institutions
Journals
Countries
 

The Top 10...
Analysis of...
Site Map by Fields
Overview Menu of all Interviews
Podcasts
Hot Papers published within the last 2 years
Current Classics
SCI-BYTES - What's New in Research
What's New in Research

in-cites, January 2004
 http://www.in-cites.com/papers/DrNormanBreslow.html

Papers

             
An interview with:
Dr. Norman Breslow
           

Below, in-cites correspondent Gary Taubes talks with Dr. Norman Breslow about his highly cited paper, "Approximate inference in generalized linear mixed models," (J. Amer. Statist. Assn. 88[421]:9-25, March 1993). According to the ISI Essential Science Indicators Web product, this paper is currently the most-cited paper in the field of Mathematics for the past decade, with 543 citations to date. Dr. Breslow’s record in this field includes 10 papers cited a total of 703 times to date. Dr. Breslow hails from the University of Washington in Seattle, where he is a Professor of Biostatistics in the School of Public Health. He is also a Member of the Biostatistics Program at the Fred Hutchinson Cancer Research Center and Adjunct Professor at the Institute of Social and Preventive Medicine at the University of Geneva.

What was the context in which you came to the work on generalized linear mixed models, which you published in the highly cited 1993 JASA article?

That’s easy. I went on sabbatical in 1990-91 to Cambridge in the U.K., where I worked with David Clayton. I had known him since he was a student. David was trained initially in psychology, but he had very strong computing and mathematics skills. Just a brilliant guy. I hadn’t really thought much about what we might work on in Cambridge, but I knew I wanted to work with David. Both of us then went to a meeting in Rome, organized by the World Health Organization, and the meeting had to do in part with small area statistics, with application to mapping of disease rates. I was called upon to comment on a paper presented by Julian Besag, who at that time was a professor of statistics in the U.K. and is now on the faculty here at Washington. He gave a very nice paper on Bayesian Markov chain Monte Carlo approaches. His original area of application had been in image restoration, but he then became interested in whether the same techniques might be useful in epidemiology. Besag wondered whether his methodology might have applications in analyzing clusters of cancer cases near point sources of environmental pollution. I didn’t know anything about this field. I had hardly ever worked with correlated data, which is what it involved. I was pretty much a classical epidemiologist and statistician. So I tried to understand as much as I could about this Monte Carlo approach, but found it to be very complicated. When I got back to Cambridge, I met with David and remarked to him that there must be a simpler way to do it and he said yes, there is. We started talking about an approximate computational method he had independently come up with to map childhood leukemia rates in the U.K., and that was the start of it.

What is it about the paper that has made it so influential, so highly cited?

“Generalized linear mixed models are here to stay.”

A major reason is that it constituted a synthesis of two important branches of statistical methodology, generalized linear models with linear mixed models, which we termed generalized linear mixed models. This jargon seems to have entered the vocabulary of the field. We also were the first to use the phrase "penalized quasi-likelihood," or PQL. PQL is not a methodology. It’s an ad hoc way of approximating a maximum likelihood solution in generalized linear mixed models and it doesn’t always work well. But it caught on.

Were you aware while you were working on the paper how significant it might be?

I hadn’t the slightest idea. I wrote it largely to teach myself about an area I previously knew virtually nothing about. I taught courses in linear mixed models from a theoretical point of view. But I never really used them in applications. This was a way of having a sabbatical and learning about a completely new field. Once I got into it, there was a lot of challenging computing to do. Part of the paper is theory, part is computing, and a lot of it is applications. A big section of the paper contains illustrative examples, in which we apply the PQL algorithm and related methodology to six different data sets.

What was the most challenging aspect of writing the paper for you?

Probably the hardest part was just getting the algorithm to work, the computing. It was the most challenging programming that I had ever to do. I took some work by Nick Longford on maximum likelihood estimation of variance components in linear mixed models and generalized it to "restricted" maximum likelihood, or REML. Then I had to write the software, get it to work, and apply it to these data sets. That’s what took most of the time.

Why do you think the model you and Clayton developed is now so widely used in so many different fields?

Well, some of my colleagues here in public health might say these models are being overused. One point is that you need something like this to analyze complicated data structures in which you have multiple sources of statistical variation. For example, this problem arises when you have repeated measures on individual subjects, which is also called longitudinal data, like the famous Framingham study. They take people in for physical exams and bring them back every few years for more measurements. The same thing is done for people infected with HIV—bring them back periodically for CD4 cell counts. If you want to model some outcome that is repeatedly measured on an individual, you need to account not only for the experimental or measurement error within the individual, but also for systematic variations between individuals. These models are also known as variance component models. In an ordinary least squares regression, or for that matter even in a generalized linear model, there is only a single variance component, a single source of random variation. Mixed models include multi-level models, which are appropriate in situations where you do the sampling in stages. Let’s say you draw a sample of schools; then within each school you draw a sample of teachers, and each teacher would have a class, from which you draw a sample of students. There already you have three levels of variability and the possibly of correlations between them.

A second point is that the PQL algorithm has been implemented in a number of commercially available computer packages for multi-level modeling of discrete statistical outcomes, such as responder vs. non-responder or cell counts. All of these computer packages reference our paper, which is probably another reason why it’s so widely cited. People want to fit models with multiple components of variation. They hadn’t been able to do so previously with discrete outcome data. They learn how useful these models are, that they explain features of the data they couldn’t explain before, and now they have software to fit them.

Do you think your models have a long shelf life?

That’s what I’m trying to evaluate in a paper I recently wrote called "Whither PQL." Generalized linear mixed models are clearly here to stay. I’d say our computational algorithm, PQL, is still holding its own. It will probably continue to be used for a certain class of problems of intermediate complexity. But there’s a new development in statistical computing that is applicable to the simpler problems. It’s kind of a rediscovery, called adaptive Gaussian quadrature, which is generally more accurate and is now being used in preference to PQL, as it should be. Monte Carlo methods for maximum likelihood estimation are also under development.

What do you consider your most satisfying work?

That’s a more difficult one. I have to say it’s in two areas and neither have anything to do with what we’ve been talking about. First, I am principal investigator on an NIH grant, doing a long-term follow-up study of a cohort of children with Wilms’ tumor, a rare childhood kidney tumor. I have been the statistician for the National Wilms’ Tumor Study Group since 1969, and we just finished our last therapeutic trial. Over the course of 34 years, we have collected data on a cohort of nearly 10,000 children with this disease. The great majority now survive since the treatments are so effective. Perhaps 90% will survive into their teenage years and 85% into adulthood. We’re now trying to determine the long-term consequences of treatment for childhood cancer, in terms of secondary tumors, congestive heart failure, renal failure, and the like. We’re also interested in the genetics of Wilms’ tumor and so we’re following up on the offspring of our survivors. It’s very exciting work.

The second area is related to my lifelong interest in statistical methods in epidemiology. In the last few years, I’ve moved away from mixed models and become quite interested in complex two-phase sampling designs for epidemiological studies. This is a very important area for epidemiology, because the designs are extremely efficient. They enable one to use all the information available in a way that standard case-control or cohort sampling designs and related analyses do not. They are particularly useful now that we’re getting into the era of molecular epidemiology, where we are sending blood and tumor tissue samples for selected patients off to labs for elaborate microarray studies.

And what makes this so satisfying?

There are two reasons. First of all, with the new two-phase study designs and analyses, epidemiologists will be able to maximize resources, to get the most accurate estimates of parameters of interest at the least possible cost. The other reason is that I’m fascinated by the mathematics—I’m intrinsically fascinated by the whole field of semi-parametric inference on which the theory is based. It uses a lot of math I tried to learn as a graduate student and thought would never be very useful at the time.End of interview

Dr. Norman Breslow
Department of Biostatistics
University of Washington
Seattle, WA, USA
  

in-cites, January 2004
 http://www.in-cites.com/papers/DrNormanBreslow.html


ScienceWatch.com - Tracking Trends and Perfomance in Basic Research
Go to the new ScienceWatch.com

Home | Search | Disclaimer | Terms of Use | Privacy Policy | Copyright
Contact Webmaster with questions/comments |
(c) 2008 The Thomson Corporation.