elow,
in-cites correspondent Gary Taubes talks with Dr. Norman
Breslow about his highly cited paper, "Approximate
inference in generalized linear mixed models," (J.
Amer. Statist. Assn. 88[421]:9-25, March 1993). According
to the ISI
Essential Science Indicators
Web product, this paper is currently the most-cited paper in
the field of Mathematics for the past decade, with 543
citations to date. Dr. Breslow’s record in this field
includes 10 papers cited a total of 703 times to date. Dr.
Breslow hails from the University of Washington in Seattle,
where he is a Professor of Biostatistics in the School of
Public Health. He is also a Member of the Biostatistics
Program at the Fred Hutchinson Cancer Research Center and
Adjunct Professor at the Institute of Social and Preventive
Medicine at the University of Geneva.
|
What was the context in which you came to the work on generalized
linear mixed models, which you published in the highly cited 1993 JASA
article?
That’s easy. I went on sabbatical in 1990-91 to Cambridge in
the U.K., where I worked with David Clayton. I had known him since
he was a student. David was trained initially in psychology, but he
had very strong computing and mathematics skills. Just a brilliant
guy. I hadn’t really thought much about what we might work on in
Cambridge, but I knew I wanted to work with David. Both of us then
went to a meeting in Rome, organized by the World Health
Organization, and the meeting had to do in part with small area
statistics, with application to mapping of disease rates. I was
called upon to comment on a paper presented by Julian Besag, who at
that time was a professor of statistics in the U.K. and is now on
the faculty here at Washington. He gave a very nice paper on
Bayesian Markov chain Monte Carlo approaches. His original area of
application had been in image restoration, but he then became
interested in whether the same techniques might be useful in
epidemiology. Besag wondered whether his methodology might have
applications in analyzing clusters of cancer cases near point
sources of environmental pollution. I didn’t know anything about
this field. I had hardly ever worked with correlated data, which is
what it involved. I was pretty much a classical epidemiologist and
statistician. So I tried to understand as much as I could about this
Monte Carlo approach, but found it to be very complicated. When I
got back to Cambridge, I met with David and remarked to him that
there must be a simpler way to do it and he said yes, there is. We
started talking about an approximate computational method he had
independently come up with to map childhood leukemia rates in the
U.K., and that was the start of it.
What is it about the paper that has made it so influential, so highly
cited?
|
|
|
“Generalized linear mixed models are here to
stay.”
|
|
A major reason is that it constituted a synthesis of two
important branches of statistical methodology, generalized linear
models with linear mixed models, which we termed generalized linear
mixed models. This jargon seems to have entered the vocabulary of
the field. We also were the first to use the phrase "penalized
quasi-likelihood," or PQL. PQL is not a methodology. It’s an ad
hoc way of approximating a maximum likelihood solution in
generalized linear mixed models and it doesn’t always work well.
But it caught on.
Were you aware while you were working on the paper how significant it
might be?
I hadn’t the slightest idea. I wrote it largely to teach
myself about an area I previously knew virtually nothing about. I
taught courses in linear mixed models from a theoretical point of
view. But I never really used them in applications. This was a way
of having a sabbatical and learning about a completely new field.
Once I got into it, there was a lot of challenging computing to do.
Part of the paper is theory, part is computing, and a lot of it is
applications. A big section of the paper contains illustrative
examples, in which we apply the PQL algorithm and related
methodology to six different data sets.
What was the most challenging aspect of writing the paper for you?
Probably the hardest part was just getting the algorithm to work,
the computing. It was the most challenging programming that I had
ever to do. I took some work by Nick Longford on maximum likelihood
estimation of variance components in linear mixed models and
generalized it to "restricted" maximum likelihood, or REML.
Then I had to write the software, get it to work, and apply it to
these data sets. That’s what took most of the time.
Why do you think the model you and Clayton developed is now so widely
used in so many different fields?
Well, some of my colleagues here in public health might say these
models are being overused. One point is that you need something like
this to analyze complicated data structures in which you have
multiple sources of statistical variation. For example, this problem
arises when you have repeated measures on individual subjects, which
is also called longitudinal data, like the famous Framingham study.
They take people in for physical exams and bring them back every few
years for more measurements. The same thing is done for people
infected with HIV—bring them back periodically for CD4 cell
counts. If you want to model some outcome that is repeatedly
measured on an individual, you need to account not only for the
experimental or measurement error within the individual, but also
for systematic variations between individuals. These models are also
known as variance component models. In an ordinary least squares
regression, or for that matter even in a generalized linear model,
there is only a single variance component, a single source of random
variation. Mixed models include multi-level models, which are
appropriate in situations where you do the sampling in stages. Let’s
say you draw a sample of schools; then within each school you draw a
sample of teachers, and each teacher would have a class, from which
you draw a sample of students. There already you have three levels
of variability and the possibly of correlations between them.
A second point is that the PQL algorithm has been implemented in
a number of commercially available computer packages for multi-level
modeling of discrete statistical outcomes, such as responder vs.
non-responder or cell counts. All of these computer packages
reference our paper, which is probably another reason why it’s so
widely cited. People want to fit models with multiple components of
variation. They hadn’t been able to do so previously with discrete
outcome data. They learn how useful these models are, that they
explain features of the data they couldn’t explain before, and now
they have software to fit them.
Do you think your models have a long shelf life?
That’s what I’m trying to evaluate in a paper I recently
wrote called "Whither PQL." Generalized linear mixed
models are clearly here to stay. I’d say our computational
algorithm, PQL, is still holding its own. It will probably continue
to be used for a certain class of problems of intermediate
complexity. But there’s a new development in statistical computing
that is applicable to the simpler problems. It’s kind of a
rediscovery, called adaptive Gaussian quadrature, which is generally
more accurate and is now being used in preference to PQL, as it
should be. Monte Carlo methods for maximum likelihood estimation are
also under development.
What do you consider your most satisfying work?
That’s a more difficult one. I have to say it’s in two areas
and neither have anything to do with what we’ve been talking
about. First, I am principal investigator on an NIH grant, doing a
long-term follow-up study of a cohort of children with Wilms’
tumor, a rare childhood kidney tumor. I have been the statistician
for the National Wilms’ Tumor Study Group since 1969, and we just
finished our last therapeutic trial. Over the course of 34 years, we
have collected data on a cohort of nearly 10,000 children with this
disease. The great majority now survive since the treatments are so
effective. Perhaps 90% will survive into their teenage years and 85%
into adulthood. We’re now trying to determine the long-term
consequences of treatment for childhood cancer, in terms of
secondary tumors, congestive heart failure, renal failure, and the
like. We’re also interested in the genetics of Wilms’ tumor and
so we’re following up on the offspring of our survivors. It’s
very exciting work.
The second area is related to my lifelong interest in statistical
methods in epidemiology. In the last few years, I’ve moved away
from mixed models and become quite interested in complex two-phase
sampling designs for epidemiological studies. This is a very
important area for epidemiology, because the designs are extremely
efficient. They enable one to use all the information available in a
way that standard case-control or cohort sampling designs and
related analyses do not. They are particularly useful now that we’re
getting into the era of molecular epidemiology, where we are sending
blood and tumor tissue samples for selected patients off to labs for
elaborate microarray studies.
And what makes this so satisfying?
There are two reasons. First of all, with the new two-phase study
designs and analyses, epidemiologists will be able to maximize
resources, to get the most accurate estimates of parameters of
interest at the least possible cost. The other reason is that I’m
fascinated by the mathematics—I’m intrinsically fascinated by
the whole field of semi-parametric inference on which the theory is
based. It uses a lot of math I tried to learn as a graduate student
and thought would never be very useful at the time.
Dr. Norman Breslow
Department of Biostatistics
University of Washington
Seattle, WA, USA
|