Thursday, December 08, 2005

Research method: Data Collection

Data Collection by observation has been the standard method of data collection for millennia. The famous theories, such as Newton’s Theory of Gravitation and Einstein’s Theory of Relativity, all have their roots in numerical data collected by careful observation. On a more mundane level, decisions concerning local traffic flow - such as, would it help to replace the crossroads by a roundabout? - are based on observations of flow made by video cameras or teams of observers.

The collection of data of a scientific nature , eg. Physical, chemical, biological data - relies almost exclusively on observation. However, in the last two centuries there has been increasing interest in the social sciences for which other methods of data collection are relevant.

Methods for Sampling a Population

Sampling frame: the list of population members

The simple random sample
Most sampling methods endeavour to give every member of the population the same probability of being included in the sample. If each member of the sample is selected by the equivalent of drawing lots, then the sample selected is described as being a simple random sample.

Cluster sampling:
1- Choosing a region at random
2- Choosing individuals at random from that region

Stratified sampling
Most population contain identifiable strata, which are distinctive non overlapping subsets of the population. Eg. For human populations, useful strata might be males and females, or ‘receiving education working and retired, or combinations such as retired female. With stratified sampling we ensure that the proportions of the population falling into these different categories are reproduced by the sample.

Systematic sampling
Both cluster sampling and stratified sampling subdivide the population into components. In both cases the final stage consists of selecting a random sample from a portion of the population. One possible method of doing the final selection is by simple random sampling. An alternative is to use systematic sampling:
1- choosing one individual at random
2- choosing every k th individual thereafter, returning to the beginning of the list when the end is reached. The value of k is not crucial, but should be chosen beforehand.

Quota sampling
This is the method often used for street interviews. The interviewer is given a series of targets. For example, he or she might be instructed to interview equal numbers of men and women, of whom one quarter should be aged over 60 and one third should be in low paid jobs. The instructions would be more detailed than these, with the idea being that each interviewer will select a representative cross section of the population. It is easy to see that an interviewer might have some difficulty in completing his or her quota - as night falls the search for an elderly red bearded giant might still be on! The results of quota sampling must always be viewed with a little suspicion, since the interviewees are not chosen at random.

Pseudo random numbers
Numbers, created using a mathematical formula, that appear indistinguishable from genuinely random numbers


Self selection
However bad quota sampling may be, it is wonderful by comparison with self-selection! The latter is exemplified by radio or TV phone-ins where listeners or viewers record their vote. The views of the apathetic majority are seriously under represented, though may be they don’t have any to represent!

Method of data collection by questionnaire - (or survey)
The most common method for collecting social science data is by means of a questionnaire which consists of a series of questions concerning the facts of someone‘s life or their opinions on some subjects. The recipient of a questionnaire is usually referred to as the respondent.
1- face to face interview
2- by post - email
3- by phone

Questionnaire design

To ask someone a series of questions might seem to be a ridiculously simple task, but this is certainly not the case. It is easy accidentally to create unanswerable questions, while small changes to the wording can make a difference to the answer obtained. Even the order of questions needs careful thought.
The pilot study uses the entire questionnaire with a small number of people who need not be chosen in any scientific way. The aim is simply to find and overcome any difficulties before using the real questionnaire.



Probability

Probable impossibilities are to be preferred to improbable possibilities - Aristotle

Relative frequency
Suppose we roll a die and are interested in the outcome 6. As we increase the number of rolls the number of 6s will increase, but the proportion of 6s - the relative frequency - will stabilise.

Preliminary definitions
A statistical experiment is one in which there are a number of possible outcomes and we have no way of predicting which outcome will actually occur. Sometimes the experiment may have already taken place, but we remain ignorant of the outcome.
The sample space, S, is the set of all possible outcomes of the experiment.
An event is any set of possible outcomes of the experiment - an event is a subset of , S.


The probability scale
Assigned to the event E is a number , known as the probability of the event E, which takes a value in the range 0 to 1 (inclusive). The number is denoted by P (E).

Probability with equally likely outcomes
Suppose that the sample space, S, consists of n(S) possible outcomes, and suppose that each is equally likely. Suppose that the number of outcomes in the event E is n(E). then P(E), the probability that the event E occurs, is given by the equation:

P(E)= n(E) divided by n(S)

This clearly satisfies the requirement that 0

The complementary event, E1

An event E either occurs or it does not! We cannot have events half occurring. Each of the possible equally likely outcomes therefore corresponds to the event occurring or to the event not occurring. If n(E) is the number of outcomes for which E occurs and n(S) is the size of the sample space, then n(S) - n(E) is the number of outcomes corresponding to the event E does not occur, which is called the complementary event, adnis denoted by E’.


Unions and intersections of events

Suppose A and B are two events associated with a particular statistical experiment. We now consider the events:
A U B A or B - At least one of A and B occurs……….is called the union of A and B
A ^ B Both A and B occur…is called the intersection of A and B

The number of outcomes in A is n(A) and the number of outcomes in B is n(B). also a total of n(A n B) outcomes is in both A and B. the outcomes in A U B include all those in A and all those in B but no others. However, if we simply add together n(A) and n(B) we will overstate the number in A U B because we will have counted those in A n B twice.

Hence;
n(A U B) = n(A) + n(B) - n(A n B)

Dividing throughout by n(S) we get:
P(A U B) = P(A) + P(B) - P(A n B)

Mutually exclusive events

Events A, B, …are said to be mutually exclusive if the occurrence of one of them implies that none of the others can occur. If D and E are two mutually exclusive events then P(D n E) = 0

Note, all simple events are mutually exclusive

The addition rule: if the events A and B are mutually exclusive, then Equation (4.3) simplifies, since P(A n B) = 0, to give:

P(A n B) = P(A) + P(B)

Which is known as the addition rule.
Note : the addition rule only applies to mutually exclusive events

Exhaustive events

Two events are said to be exhaustive if it is certain that at least one of them occurs. For example, when rolling a die it is certain that at least one of the events A: the number obtained is either 1,2,3, or 5’ and B: the number obtained is even will occur. In this example, if a 2 is obtained then both A and B occur. If the events A and B are exhaustive then:
P(A U B) = 1

Notes:
- any event A and its complement, A‘, are both exhaustive and mutually exclusive:
P(A U A’) = 1 , P(A n A’) = 0

- the events A, B, …N are said to be exhaustive if it is certain that at least one of them occurs:
P(A or B or…..N) = P(A U B U……..U N) = 1

Thus the simple events that make up the sample space, S, are mutually exclusive and exhaustive

Probability trees

Probability trees are diagrams that help us to see what is happening! Consider the following problem. A fair coin is tossed three times. Determine P(exactly two heads are obtained).

Each time we toss the coin the number of distinguishable outcomes increases:
After first toss Either H or T
After second toss The sequence of outcomes must be HH, HT, TH, or TT
After third toss Either HHH, HHT, HTH, HTT, THT, THH or TTT

This can be represented in a tree diagram in which the final column lists the entire sample space.


Sample proportions and probability

So far the probability to be associated with an event has been expressed in terms of the numbers of simple events in a sample space in which all the possible outcomes are equally likely. An alternative view of probability is a consequence of the general idea that a sample of observations gives information about the population from which it is derived. The bigger the sample, the more reliable is the information.

We have to adapt this approach when the outcomes in the sample space are no longer equally likely. For example, if we are interested in the probability that a bent penny comes down heads, then an obvious approach is to toss the penny a number of times (our sample) and see what proportion of the time a head is obtained:

Determine the sample proportion ¬ Estimate the population probability

As the sample size increases, so the observed sample proportion of occasions on which the event E occurs will vary. However, the variations will generally decrease in magnitude, and we expect that the observed sample proportion will approach a value that we will take to be the probability of E and will denote by P(E).

Consider the following two situations:
Experiment Event
A fair die is tossed A: A 6 is obtained
A car is chosen at random B: the car is white

For event A it seems reasonable that if we were to roll a fair die a huge number of times then obviously the event A would occur on about one-sixth of occasions: P(A) =1|6
There is no need to do any real sampling - we need only think about it!
For event B, however, there is no alternative to real sampling. To have any idea of the value of P(B), we need to examine a large sample of cars to find out what (roughly) is the proportion of cars that are white.

Unequally likely possibilities

The results so far have been obtained while considering equally likely simple events. However, this restriction is artificial and Equations (4.2) to (4.5) hold equally well for unequally likely events.


Physical independence

By physical independence we mean that the outcome of one component (in the case of coin tossing , the first toss) can have no possible influence on the outcome of any other component.
The multiplication rule
If A and B are two events relating to physically independent situations then:

P(A n B) = P(A) x P(B)

More generally, if A,B ,…N all relate to physically independent situations (for example, N separate tosses of a coin) then:

P(A n B n …n N) = P(A) x P(B)x ….x P(N)

This very useful result is known as the multiplication rule.

Orderings

Consider the following problem
Four markers are arranged in line. The markers are labelled A, B, C and D. Assuming that all possible arrangements are equally likely, determine the probability that ht markers are in the order ABCD.

In all systematic list of the possible arrangements, there are 24 possible orderings of the markers. Since each ordering is equally likely, the required probability is 1|24.
The problem with this sort of approach is that frequently the number of elementary events in the sample space is so large that we may miss a few out! What is needed is a formula that allows us to count the possibilities without actually making a list.

In general, if there were n objects, the number of possible orderings would be:

n x (n - 1) x (n - 2) x ….3x2x1

This is tedious to write out, so we use the notation;
N!= n x (n - 1) x (n-2) x …x 3x2x1

The quantity n! is read as n factorial.

Notes (n+1)!= (n+1) x n!
For convenience, 0! Is defined to be equal to 1.

Permutations and combinations

A pack of 52 playing cards is shuffled. Determine the probability that the top card in the pack is the Ace of Spades, the next is the Ace of Hearts and the next is the Ace of Diamonds.

Now any one of the 52 cards could have been at the top of the pack. This leaves 51 cards, anyone of which might have been next. Similarly there are 50 possibilities for t he third card. There are therefore a total of 52 x 51 x 50 = 132600
Possibilities for the first three cards in order. Only one of these corresponds to the event described, so the probability of that event is 1|132600.

The number of ordered arrangements of r objects chosen from a collection of n objects, is denoted by n P r, and each ordering is called a permutation of the selected objects.

Sampling with replacement

The situation is one of physical independence and we can use the addition and multiplication rules improbability trees. Here is a typical problem. A pack of cards consists of th equeeens of spades, hearts, diamonds and clubs together with the Ac, king and Jack of Spades. The pack is shuffled and a card is chosen at random. After its identity has been noted, the card is replaced in the pack , which is again shuffled. This is repeated on two further occasions. Determine the probability that a queen is chosen on only one occasion

On each occasion the probability that a queen is chosen is 4|7. Using Q to denote a queen and R to denote one of the other cards, the possibilities that include exactly one queen are RRQ, RQR and QRR. For each of these possibilitie, the probability is the produc tof 3|7 , 3|7 and 4|7 so th eoverall probability is

3 x (3|7) sq 2 x 4|7 = 108|343

Which is about 0.315

Page 95

















Graham Upton, and Ian Cook, Introducing statistics, Oxford University Press, 1998, (B 873), (P. 60 -79)