Thursday, March 08, 2007

Research Methods

Introduction to Bayesian inference and computation for social science data analysis


• Bayesian methods have been widely applied in many areas
– medicine / epidemiology / genetics
– ecology / environmental sciences
– finance
– archaeology
– political and social sciences, ………
• Motivations for adopting Bayesian approach vary
– natural and coherent way of thinking about science and learning
– pragmatic choice that is suitable for the problem in hand

• Medical context: FDA draft guidance www.fda.gov/cdrh/meetings/072706-bayesian.html:
“Bayesian statistics…provides a coherent method for learning from evidence as it accumulates”
• Evidence can accumulate in various ways:
– Sequentially
– Measurement of many ‘similar’ units (individuals, centres, sub-groups, areas, periods…..)
– Measurement of different aspects of a problem
• Evidence can take different forms:
– Data
– Expert judgement
• Bayesian approach also provides formal framework for propagating uncertainty
– Well suited to building complex models by linking together multiple sub-models
– Can obtain estimates and uncertainty intervals for any parameter, function of parameters or predictive quantity of interest
• Bayesian inference doesn’t rely on asymptotics or analytic approximations
– Arbitrarily wide range of models can be handled using same inferential framework
– Focus on specifying realistic models, not on choosing analytically tractable approximation


Bayesian Inference

• Distinguish between
x : known quantities (data)
q : unknown quantities (e.g. regression coefficients, future outcomes, missing observations)
• Fundamental idea: use probability distributions to represent uncertainty about unknowns
• Likelihood – model for the data: p( x | q )
• Prior distribution – representing current uncertainty about unknowns: p(q )
• Applying Bayes theorem gives posterior distribution

Conjugate Bayesian inference

• Example: election poll (from Franklin, 2004*)
• Imagine an election campaign where (for simplicity) we have just a Government/Opposition vote choice.
• We enter the campaign with a prior distribution for the proportion supporting Government. This is p(q )
• As the campaign begins, we get polling data. How should we change our estimate of Government’s support?
Data and likelihood
• Each poll consists of n voters, x of whom say they will vote for Government and n - x will vote for the opposition.
• If we assume we have no information to distinguish voters in their probability of supporting government then we have a binomial distribution for x


This binomial distribution is the likelihood p(x | q )

Prior
• We need to specify a prior that
– expresses our uncertainty about the election (before it begins)
– conforms to the nature of the q parameter, i.e. is continuous but bounded between 0 and 1
• A convenient choice is the Beta distribution
Posterior
• Combining a beta prior with the binomial likelihood gives a posterior distribution
• When prior and posterior come from same family, the prior is said to be conjugate to the likelihood
• Occurs when prior and likelihood have the same ‘kernel
• Suppose I believe that Government only has the support of half the population, and I think that estimate has a standard deviation of about 0.07
• This is approximately a Beta(50, 50) distribution
• We observe a poll with 200 respondents, 120 of whom (60%) say they will vote for Government
• This produces a posterior which is a
Beta(120+50, 80+50) = Beta(170, 130) distribution
A harder problem
• What is the probability that Government wins?
– It is not .57 or .60. Those are expected votes but not the probability of winning. How to answer this?
• Frequentists have a hard time with this one. They can obtain a p-value for testing H0: q > 0.5, but this isn’t the same as the probability that Government wins
– (its actually the probability of observing data more extreme than 120 out of 200 if H0 is true)
• Easy from Bayesian perspective – calculate Pr(q > 0.5 | x, n), the posterior probability that q > 0.5

Bayesian computation
• All Bayesian inference is based on the posterior distribution
• Summarising posterior distributions involves integration
• Except for conjugate models, integrals are usually analytically intractable
• Use Monte Carlo (simulation) integration (MCMC)

• Can also use samples to estimate posterior tail area probabilities, percentiles, variances etc.
• Difficult to generate independent samples when posterior is complex and high dimensional
• Instead, generate dependent samples from a Markov chain having p(q | x ) as its stationary distribution → Markov chain Monte Carlo (MCMC)

Borrowing strength

• Bayesian learning → borrowing “strength” (precision) from other sources of information
• Informative prior is one such source
– “today’s posterior is tomorrows prior”
– relevance of prior information to current study must be justified

Informative Prior

Example 1: Western and Jackman (1994)*
• Example of regression analysis in comparative research
• What explains cross-national variation in union density?
– Union density is defined as the percentage of the work force who belongs to a labour union
• Two issues
– Philosophical: data represent all available observations from a population → conventional (frequentist) analysis based on long-run behaviour of repeatable data mechanism not appropriate
– Practical: small, collinear dataset yields imprecise estimates of regression effects

• Competing theories
– Wallerstein: union density depends on the size of the civilian labour force (LabF)
– Stephens: union density depends on industrial concentration (IndC)
– Note: These two predictors correlate at -0.92.
• Control variable: presence of a left-wing government (LeftG)
• Sample: n = 20 countries with a continuous history of democracy since World War II
• Fit linear regression model to compare theories union densityi ~ N(mi, s2)

• Results with non-informative priors on regression coefficients (numerically equivalent to OLS analysis)

Motivation for Bayesian approach with informative priors
• Because of small sample size and multicollinear variables, not able to adjudicate between theories
• Data tend to favour Wallerstein (union density depends on labour force size), but neither coefficient estimated very precisely
• Other historical data are available that could provide further relevant information
• Incorporation of prior information provides additional structure to the data, which helps to uniquely identify the two coefficients

Prior distributions for regression coefficients
Wallerstein
• Believes in negative labour force effect
• Comparison of Sweden and Norway in 1950:
→ doubling of labour force corresponds to 3.5-4% drop in union density
→ on log scale, labour force effect size ≈ -3.5/log(2) ≈ -5
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b2 ~ N(-5, 2.52)

Prior distributions for regression coefficients
Stephens
• Believes in positive industrial concentration effect
• Decline in industrial concentration in UK in 1980s:
→ drop of 0.3 in industrial concentration corresponds to about 3% drop in union density
→ industrial concentration effect size ≈ 3/0.3 = 10
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b3 ~ N(10, 52)

Prior distributions for regression coefficients
Wallerstein and Stephens
• Both believe left-wing gov’ts assist union growth
• Assuming 1 year of left-wing gov’t increases union density by about 1% translates to effect size of 0.3
• Confidence in direction of effect represented by prior SD giving 95% interval that excludes 0 b1 ~ N(0.3, 0.152)
• Vague prior b0 ~ N(0, 1002) assumed for intercept

• Effects of LabF and IndC estimated more precisely
• Both sets of prior beliefs support inference that labour-force size decreases union density
• Only Stephens’ prior supports conclusion that industrial concentration increases union density
• Choice of prior is subjective – if no consensus, can we be satisfied that data have been interpreted “fairly”?
• Sensitivity analysis
– Sensitivity to priors (e.g. repeat analysis using priors with increasing variance)
– Sensitivity to data (e.g. residuals, influence diagnostics)

Hierarchical Priors

• Hierarchical priors are another widely used approach for borrowing strength
• Useful when data available on many “similar” units (individuals, areas, studies, subgroups,…)
• Data xi and parameters qi for each unit i=1,…,N
• Three different assumptions:
– Independent parameters: units are unrelated, and each qi is estimated separately using data xi alone
– Identical parameters: observations treated as coming from same unit, with common parameter q
– Exchangeable parameters: units are “similar” (labels convey no information) → mathematically equivalent to assuming qi’s are drawn from common probability distribution with unknown parameters

Accounting for data quality

• Bayesian approach also provides formal framework for propagating uncertainty about different quantities in a model
• Natural tool for explicitly modelling different aspects of data quality
– Measurement error
– Missing data

Model uncertainty

• Model uncertainty can be large for observational data studies
• In regression models:
– What is the ‘best’ set of predictors for response of interest?
– Which confounders to control for?
– Which interactions to include?
– What functional form to use (linear, non-linear,….)?
• Example 5: Predictors of crime rates in US States (adapted from Raftery et al, 1997)
• Ehrlich (1973) – developed and tested theory that decision to commit crime is rational choice based on costs and benefits
• Costs of crime related to probability of imprisonment and average length of time served in prison
• Benefits of crime related to income inequalities and aggregate wealth of community
• Net benefits of other (legitimate) activities related to employment rate and education levels in community
• Ehrlich analysed data from 47 US states in 1960, focusing on relationship between crime rate and the 2 prison variables
• Up to 13 candidate control variables also considered
• y = log crime rate in 1960 in each of 47 US states
• Z1, Z2 = log prob. of prison, log av. time in prison
• X1,…, X13 = candidate control variables
• Fit Normal linear regression model
• Results sensitive to choice of control variables

Discussion

• Bayesian approach provides coherent framework for combining many sources of evidence in a statistical model
• Formal approach to “borrowing strength”
– Improved precision/effective sample size
– Fully accounts for uncertainty
• Relevance of different pieces of evidence is a judgement – must be justifiable
• Bayesian approach forces us to be explicit about model assumptions
• Sensitivity analysis to assumptions is crucial
• Bayesian calculations are computationally intensive, but:
– Provides exact inference; no asymptotics
– MCMC offers huge flexibility to model complex problems
• All examples discussed here were fitted using free WinBUGS software: www.mrc-bsu.cam.ac.uk
• Want to learn more about using Bayesian methods for social science data analysis?
– Short course: Introduction to Bayesian inference and WinBUGS, Sept 19-20, Imperial College
See www.bias-project.org.uk for details

Source: http://remiss.politics.ox.ac.uk