Quantitative Method
Modelling Discrete Data: An Overview
Discreteness
Categories: e.g., single, married, divorced
Counts: e.g., number of children
Rounded measurement: e.g., earnings to nearest $1000k.
The last of these should only really be regarded as discrete if the rounding is very coarse. (It is ‘in principle’ continuous.)
Models
(a) Multivariate: describe economically the joint distribution of several variables
(b) Dependence: describe economically the conditional distribution of one or more variables Y given fixed values of other (explanatory, or predictor) variables X.
Many research questions are answered via (b).
Models of kind (a) are often exploratory, suggestive of research questions.
Counting
(A) ‘Pure’ counts: number of ‘events’ (e.g., children, criminal arrests, squash
courts) in a given amount of ‘exposure’ (years, police-officer-years, km2).
Interest is in the rate per unit of exposure, and how that rate depends on other
variables.
(B) Category counts: number of individuals (e.g., men, constituencies, sporting
events) falling into categories of a cross classification (father’s class by own class, party of MP, sport by period by nation).
Interest is in the interdependence of the cross-classifying variables.
Counting of type (B) converts qualitative data to quantitative, for statistical analysis.
Variation
• from time to time
• from place to place
• from sample to sample
etc.
Variation may be:
systematic: in which case it is either the object of interest, or needs to be taken into account to avoid biased conclusions
random: sometimes of substantive interest, more often a nuisance to be allowed for in reporting the precision of conclusions (e.g., sampling error)
Statistical models represent both kinds of error.
Distributions
• ways of describing random variation
• for counts, the most important are
– Poisson (for ‘pure’ counts)
– binomial, multinomial (for category counts)
Others (e.g., negative binomial, beta-binomial) may be used where there is overdispersion relative to the Poisson or binomial.
Poisson distribution
Consider events occurring in time, or space, or whatever:
• singly
• independently
• at a constant rate (_, say)
Number of events Y in time t has distribution
Y _ Poisson(_t).
Mean and variance are E(Y ) = var(Y ) = _t.
Large counts vary more than small counts. But the coefficient of variation is sd(Y )/E(Y ) = 1/p_t, so large counts are more informative. (Obviously!!)
Binomial distribution
When m independent individuals are allocated at random to one of two categories, the number Y in category 1 has the binomial distribution:
Y _ binomial(m; _)
where _ is the (assumed constant) probability of being allocated to category 1. Interpretation of _ is often as the population proportion that would be allocated to category 1.
Mean and variance are E(Y ) = m_, var(Y ) = m_(1 − _).
Multinomial: if there are k categories,
(Y1, . . . , Yk) _ multinomial(m; _1, . . . , _k)
with _+ = 1.
Binomial is simply the special case with k = 2.
Some relationships
1. Poisson variables conditional on their total: if
Yi _ Poisson(μi) (i = 1, . . . , k) then
(Y1, . . . , Yk)|(Y+ = m) _ multinomial(m; _1, . . . , _k)
with _i = μi/μ+.
So there is formal equivalence between some Poisson and multinomial models.
This is sometimes exploited to fit multinomial models in software designed for
(univariate-response) generalized linear models; more modern software provides more explicit facilities for multinomial models.
2. Subtotals: if (Y1, . . . , Yk) _ multinomial(m; _1, . . . , _k), and Y _ = Pi2S Yi
for some subset S, then
Y _ _ binomial(m; __)
where __ = Pi2S _i.
3. Conditional multinomial: e.g.,
(Y1, . . . , Yt)|(Y1 + . . . + Yt = m_) _ multinomial(m_; __1, . . . , __t )
where __i = _i/Pt
j=1 _j.
Poisson response
Suppose Yi is number of events in time (or other exposure quantity) ti. Aim to
relate the distribution of Yi to explanatory variables xi1, . . . , xip.
Distribution is Yi _ Poisson(_iti) — determined entirely by the mean, _iti. The
rate _i is usually the object of interest.
The most standard model is log-linear :
log _i = xi1_1 + . . . + xip_p
Interpretation: exp(_r) is the factor by which the rate of occurrence _i is multiplied when xir increases by one unit (with other explanatory variables held constant).
Log-linear models for Poisson counts thus embody the notion that effects are multiplicative on the rate of occurrence. This is very natural for many applications.
It will not always be appropriate to assume that effects are multiplicative. For
example, it may be that there is a ‘background’ rate which is additive:
_i = exp(_) + __i , say,
where __i perhaps satisfies the log-linear model above. That is, most of the effects
are rate-multipliers, but the background effect is additive.
The appropriate specification to use in any particular application demands some thinking about the data-generating process. The choice may sometimes need to be informed by fitting competing specifications to data, followed by suitable diagnostics.
The log-linear model is an example of a generalized linear model (more later).
The mixed additive-multiplicative model is not.
Logits: pros and cons
1. In practice with binomial-response data there is very little difference between
using logit and probit link functions. Coefficients are on a different scale, but conclusions will be similar.
2. Logit is symmetric (_ and 1 − _ can be interchanged, and only the sign of coefficients is affected). So is probit. The log-log links are not (and this provides some flexibility if needed: one of the log-log links may fit better than the other).
3. Can yield nonsensical predictions (fitted values), namely probabilities implausibly close to 0 or 1. (A criticism more usually levelled at linear probability models, but logit-linear models are not necessarily better).
Generalized linear models
The notion of relating the expected value of a response (such as a count or binomial proportion) to explanatory variables, through a link function, is very general.
We have seen that it may not always be the right thing to do (there may be nonlinearity to account for). But still it provides a useful starting-point for thinking about dependence in situations where the standard linear model is problematic.
The generalized linear model is simplest thought of in terms of mean and variance:
E(Yi) = μi = g−1(Xxir_r), var(Yi) = _V (μi).
Why is it useful to think of Poisson loglinear models and logit/probit/etc regressions as generalized linear models?
• It’s not essential. A good understanding of those models can be had without the full GLM framework.
• BUT generalized linear models all have various useful features in common:
– linear predictor
– efficient algorithm for computing maximum likelihood estimates (iterative weighted least squares)
– ‘analysis of deviance’ for model screening and model choice
– definitions of residuals and other diagnostics
These common features are especially well exploited in good software programs, which provide the same interface to all GLMs both in terms of model specification and model criticism. The earliest and most famous example was GLIM, introduced in the 1970s. A good modern example is glm() in R or S-Plus.
Overdispersion
The standard distributions all have variances determined by the mean, e.g.,
Poisson has variance = mean.
Often in practice (most often, in fact) the residual variance after fitting a model is larger than it should be under the relevant (e.g., Poisson or binomial) model. A standard measure of such overdispersion is X2/(n − p) = ˜_, say, where n − p is the residual degrees of freedom for the model and
X2 =X(yi − ˆμi)2/V (μi) is the ‘Pearson chi-squared statistic’, the sum of squared ‘Pearson’ residuals. An alternative to X2 here is the model deviance D (sometimes labelled G2).
The cause of overdispersion is either positive correlation of responses, or missing explanatory variables (and these alternative causes are hard or impossible to distinguish).
The effect of overdispersion is to make the ‘usual’ reports of precision—i.e., standard errors, etc.—too small. An approximate remedy is to multiply all standard errors by p˜_.
If ˜_ is appreciably less than 1 it is an indication of underdispersion, a less common phenomenon in social-scientific work, caused usually by inhibition of events by other events, or by regularity. (e.g., the number of buses passing my house each hour is, thankfully, severely underdispersed relative to the Poisson distribution).
The approximate ‘fix factor’ p˜_ is justified theoretically by the notion of quasi-likelihood, based on the simple assumption that _ in the GLM formulation takes a value different from 1.
A more elaborate approach is to use a ‘robust’ or ‘sandwich’ estimator of the standard errors of estimated coefficients, and this is implemented in many common software systems. However, such estimators are rather unstable (non-robust!) except with very large datasets.
Source: ESRC Oxford Spring School in Quantitative Method for Social Research
http://springschool.politics.ox.ac.uk/
springschool/archive.asp
<< Home