Thursday, January 04, 2007

An Overview of Statistical Models and Statistical Thinking

Preface
Models are central to (almost) all statistical work.
This short course aims to give an overview of some of the
most prominent statistical models, and associated methods
for inference, interpretation and criticism, used in social
research.
The focus will be on models of dependence.
Much will be treated very superficially. The aim is an
overview: more detailed understanding will require further
reading.
Computer lab sessions will be used to illustrate some of the
less standard models/methods.
Purposes of Modelling
e.g., multiple linear regression
yi = _0 + _1xi + _2zi + ei
Explanation: How does (the mean of) y change with x and z?
Prediction: What is the expected value of y, and how much
variation is there in the value of y, at a
particular combination of x and z?
Accounting: What is the average value of y in the
population? [This is a special application of
prediction, really.]
Some General Principles
To be useful, a statistical model should:
ñ embody the research questions of interest, via (functions
of) parameters, (conditional) independence
relationships, etc.
ñ take account of research design (e.g., cluster sampling)
ñ reflect all important structure — systematic and
haphazard variation — in the data
ñ not be more complicated than necessary
What is a statistical model?
A statistical model is a family of probability distributions.
For example, N(μ,_2) is a distribution. The parameters μ
and _ together index a family of such distributions: each
different (μ,_) combination corresponds to a different
normal distribution.
In the linear model
yi = _0 + _1xi + _2zi + ei,
if we assume ei _ N(0,_2), the family is a set of normal
distributions indexed by _0,_1,_2,_.
Univariate/Multivariate
This terminological distinction varies between disciplines.
In discussing models of dependence, univariate means that
there is a single response or outcome variable (y). The
number of explanatory variables (predictor variables,
covariates) may be none, one, or many.
By contrast, multivariate means modelling the distribution
of two or more response variables (y1,y2, . . .) jointly. Again,
the number of explanatory variables may be none, one or
many.
Time Series and Panel Data
Here the univariate/multivariate distinction is less clear.
Same variable (e.g., income) measured at a series of
timepoints:
yt is the value at time t.
Time series: typically means a long series of observations,
e.g., monthly exchange rate over 20 years. Univariate
timeseries
models relate to the study of such a series in
isolation; multivariate if that series is related to others, such
as monthly interest rate for the same timepoints,
etc.
Panel data: several short time series, one for each unit of
analysis. Aim is often to relate mean level or trend to
explanatory variables. Sometimes univariate models suffice;
sometimes multivariate models are needed.
Level of Assumptions
The family of distributions may be parametric,
nonparametric,
or ‘semiparametric’.
Broadly speaking:
parametric: the family of distributions is fully specified,
up to a small number of unknown
parameters. For example,
y _ N(_0 + _1x,_2).
semiparametric:
some aspects of the distribution are
described by parameters, but others are
left unspecified. For example,
E(y) = _0 + _1x, var(y) = _2; here the
distributional shape is unspecified.
nonparametric:
no distributional assumptions other than
known consequences of the sampling
scheme.
A nonparametric ‘model’ is really no more than a
description of the sampling scheme and a definition of the
quantity of interest. This has appeal in regard to robustness
to failure of assumptions (no assumptions are made!).
In practice, nonparametric models are used only with simple
research designs involving one or two variables, for which
there is welldeveloped
statistical theory.
Parametric models, backed up by thorough diagnostic
checking of assumptions, are much more widely used in
social research. Likelihood provides a unified framework for
inference
Semiparametric
models provide a middle road, in situations
where there are parameters of primary interest and other
aspects that do not need to be modelled in detail. This
admits the use of ‘robust’ methods of inference (partial
likelihood, quasilikelihood,
GEE, etc.) which work for the
primary parameters under weak conditions on the
unmodelled
aspects. Examples include:
ñ the Cox proportional hazards model for duration data
ñ overdispersed models for binomial/count data
ñ marginal models for clustered/panel data
Linearity
The ‘linear’ in ‘linear model’ refers to linearity in the
parameters.
Thus
E(y) = _0 + _1x + _2x2
is a linear model for the mean of y, but
E(y) = _0 + _1x_2
is nonlinear.
Linear models are easier to fit, have nicer statistical theory,
and are simpler to interpret.
Transformation to Linearity
Sometimes a nonlinear model can be transformed into a
linear model, by applying a suitable mathematical function to
y and/or explanatory variables.
A common case is the transformation of a multiplicative
model such as
yi = _0x_1
i z_2
i ei
(with E(ei) = 1) into the linear form with additive error
logyi = __0 + _1 logxi + _2 log zi + e_i
Generalized Linear Model
Nonlinear
models of the particular form
g[E(y)] = _0 + _1x + _2z + ...
still (typically) require iterative fitting, but they inherit much
of the nice statistical theory and ease of interpretation.
Here g is a specified, smooth, monotone transformation
known as the link function. On the scale of the link function,
E(y) is described by the familiar kind of linear predictor.
GLM versus Response Transformation
e.g., loglink
generalized linear model
log E(y) = _0 + _1x
versus logtransformation
of the response variable
E(logy) = _0 + _1x.
The second is a linear model for the mean of logy. This may
or may not be appropriate; e.g., if y is income, perhaps we
are really interested in the mean income of population
subgroups, in which case E(y) and not E(logy) is the right
thing to model.
The second also has technical problems if any y = 0.
Additive models with flexible functional dependence allow
exploration of the shape of ‘maineffect’
aspects of
dependence.
At the opposite extreme is flexible exploration of interaction.
In some contexts, interaction effects may be far from linear.
One tool for exploring ‘unstructured’ interaction effects is
the regression tree model (also known as recursive
partitioning).
Regression trees can be hard to interpret, and can be very
nonsmooth
(hence unrealistic). But they are a useful
exploratory device for complex dependences.
Responsevariable
type
The following response (yvariable)
types are commonly
encountered:
ñ ‘continuous’ (measurements etc.)
ñ counts (of events etc.)
ñ categorical
ñ binary
ñ nominal
ñ ordinal
ñ durations (survival or eventhistory
data)
Mixture of continuous and discrete (e.g., where response is
either zero or a positive amount of something) sometimes
occurs, and demands special care.
Response type: continuous
If a model is required for E(y), consider GLM with
suitablychosen
link function.
Alternatively, use a linear model, possibly after a nonlinear
transformation of y.
GLM has advantage of allowing variance to depend on mean
in a specified way. For example, with homogeneous
multiplicative errors, variance = _[E(y)]2.
In a GLM (or GAM) the link function is chosen to achieve
linearity (additivity) of the right hand side.
Often (but not necessarily) this means linking the mean in
such a way that g[E(y)] can take any real value. For
example, if E(y) > 0, g(μ) = log μ will often be a candidate.
Response type: counts
e.g., numbers of arrests made by different police forces in
different time periods.
Interest is most often in the rate of occurrence per unit of
exposure, where ‘exposure’ might be amount of time,
population at risk, personhours
of effort, or a composite.
Most natural starting point is a Poisson model with log link:
yi _ Poisson(μi), with, say,
log μi = log ti + _0 + _1xi + _2zi
where ti is the known exposure quantity for the ith count.
The term log ti here, with no unknown coefficient attached to
it, is called an offset. It ensures that the other effects are all
interpretable as ratemultipliers.
Response type: nominal categories
Write _ij for the probability that the ith respondent is in
category j (j = 1, . . . , k). Aim to model the dependence of
(_i1,_i2, . . . ,_ik) on xi, zi, etc.
Some possibilities:
ñ ‘multinomial logit’ model: separate linear predictors for
each of the logits log(_ij/_i1) (j = 2, . . . , k). The choice
of ‘reference’ category is arbitrary, it does not affect the
model.
ñ ‘nested logit’ model: e.g., if k = 4, find separate linear
predictors for logit(_i1), logit[_i4/(1 − _i1)] and
logit[_i2/(_i2 + _i3)]. The particular logits chosen will
depend on the context (the research questions of
interest). Maybe fewer than k − 1 will be needed.
Response type: ordered categories
Methods for nominal categories fail to exploit the extra
information in the ordering.
Some better approaches:
ñ ‘cumulative link’ models: with ij = _i1 + . . . + _ij
(j = 1, . . . , k − 1), model
g(ij) = _j − _1xi − _2zi
‘Ordered logit’ (aka ‘proportional odds’) and ‘ordered
probit’ models are examples.
ñ invariant under merging of categories
ñ assumes that the dependence is the same at every ‘cut’
of the response scale.

Response type: ordered categories (continued)
ñ use category scores: attach a numeric score sj to each
category (e.g., sj = j), and let yij = sj when respondent
i is in category j. Then construct a linear or generalized
linear model for the category scores, treating them as
quantitative responses.
ñ scores are arbitrary. Different scores may yield different
conclusions. Check sensitivity to choice of scores.
ñ advantage is availability of linear models and familiar
summaries/diagnostics.
In practice this will always be a sensible first analysis (and,
depending on the results, may be conclusive).
Response type: duration data
Response yi is time to an event (e.g., time to death, to
employment...)
Aim is to describe the dependence of the distribution of yi
on explanatory variables.
A complication is censoring: yi is not observed, but yi > si.
Models for E(yi) do not deal easily with censoring.
Usually better to model the hazard (or ‘force of mortality’),
hi(t) = fi(t)/[1 − Fi(t)]. This usually depends on t, though.
The proportional hazards model assumes that
hi(t) = h0(t) exp(_1x + _2z)
Random Effects
Crossed: the grouping involves a crossclassification.
For
example, pupils j within (classes c × teachers t) within
schools s; or pupils j within (schools s × home
neighborhoods h).
For example, model with random intercepts determined by
additive school and neighborhood effects:
g[E(yshj)] = (_0 + bs + ch) + _1xi + _zxij .
with, say, bs _ N(0,_2
b ) and ch _ N(0,_2
c ) independently.
Random Effects
Random slopes: dependence on grouping is not restricted
to intercepts.
For example,
g[E(yij)] = (_0 + bi) + _1xi + (_2 + ci)zij .
allows the effect of pupilspecific
zij to depend on school (i).
Care is needed in specifying the joint distribution of the
random effects bi, ci. Most often the same model should
hold regardless of the choice of origin for zij (e.g., it should
not matter whether we use ‘age since birth’ or ‘years since
age 4’) — in which case it makes no sense to assume that bi
and ci are uncorrelated, since that can only be the case for
one choice of origin. So here three parameters are needed to
describe the distribution of (bi, ci): they are _2
b , _2
c and _bc.
Random Effects
Nonlinear: for example, in repeated responses on the scale
very left left neutral right very right
subjects may have both a leaning (to left or right) and/or a
tendency to prefer the extreme categories (or not).
So if items i with attributes xi, zi are presented to subjects s,
a suitable randomeffects
formulation of the cumulative logit
model might be
logit(isj) = cs(_j + as) − _1xi − _2zi.
Threshold parameters _j are shifted and squeezed (cs < 1) or
spread (cs > 1) for each subject, to reflect their personal
interpretation of the response scale.
Here as and cs might be assumed independent, with
E(as) = 0 and E(cs) = 1.
Random or Fixed?
In a random effects model, we replace (say) schoolspecific
parameters with a description of the population of schools.
If there are, say, 15 schools in the study, and lots of
information on each schoolspecific
parameter, a
fixedeffects
model will be best for most purposes. Fifteen is
not a large enough sample to allow accurate estimation of
population variance.
Random effects models are most effective when the number
of groups is large and there is relatively little information per
group.
Use of random effects models where the groups represent
unique cases of interest (e.g., 15 European nations), seems
fundamentally misguided.
Interdependent responses
Standard generalized linear model assume responses to be independent given the linear predicator
Common situation where this assumption is inadequate:
-units are matched pairs
-units are in clusters, or clusters of clusters
- repeated measurements (longitudinal, panel data)
- responses close in time
-responses close in space
-multivariate responses

Interdependent responses
Some general considerations:
ñ failure of the independence assumption should not be
neglected. It will most often result in spurious apparent
precision (spurious significance of effects, standard
errors misleadingly small, etc.)
ñ testing for independence is misguided, if the design
implies nonindependence.
Failure to find ‘significant’
error correlation is failure; it is not evidence of
independence.
ñ detailed modelling of the dependence is often
unnecessary.
ñ sometimes analysis of one or more derived summaries
suffices
ñ sometimes ‘marginal models’ suffice
Derived summaries
e.g., repeated measurements yi1, . . . ,yik (where k is often
small, between 2 and 10 say):
level summary: y_i = k−1(yi1 + . . . + yik)
trend summary: yi__ = regression coefficient of
{yij : j = 1, . . . , k} on time.
Modelling one or other of these may suffice to answer the
research question(s) of interest.
With ordered pair data (e.g., before/after measurements), the
‘trend’ summary would usually be yi2 − yi1 or yi2/yi1,
depending on the nature of y and the questions of interest.
Marginal regression
e.g., clustered data, individuals j within clusters i.
An alternative to randomintercepts
model
g[E(Yij)] = (_0 + bi) + _1xij + _2zij ,
where the Yij are assumed independent given bi, is a
marginal model
g[E(Yij)] = _0 + _1xij + _2zij ,
in which the Yij are allowed to be correlated within groups.
When g is the identity link (i.e., no link transformation), these
are equivalent. More generally they are not: the coefficients
then have different interpretations.
Inference in marginal models is handled by the method of
‘generalized estimating equations’.
Specialized Models
All applications are specialized, some are more specialized
than others...
The models introduced so far should be considered as
generic templates for statistical thinking, rather than ‘off the
shelf’ solutions for particular research problems.
In some applications, even the templates are specialized.
Examples include:
ñ special loglinear
models (quasi independence, quasi
symmetry) for the analysis of square contingency tables
(e.g., social mobility, rater agreement, pair comparisons)
ñ ‘logmultiplicative’
interaction structure to measure the
strength of association between categorical variables.
Specialized Models
All applications are specialized, some are more specialized
than others...
The models introduced so far should be considered as
generic templates for statistical thinking, rather than ‘off the
shelf’ solutions for particular research problems.
In some applications, even the templates are specialized.
Examples include:
ñ special loglinear
models (quasi independence, quasi
symmetry) for the analysis of square contingency tables
(e.g., social mobility, rater agreement, pair comparisons)
ñ ‘logmultiplicative’
interaction structure to measure the
strength of association between categorical variables.


David Firth, An Overview of Statistical Models and Statistical Thinking, ESRC Oxford Spring School, March 2006