Random Samples

views updated

Random Samples

RANDOM SAMPLING AND STATISTICAL INFERENCE

A teacher has four students who have done well in their homework, and she wants to reward them by assigning them a special task that they enjoy. The problem is that the task requires only two students. What is a “fair” way to choose the two students who receive the reward? When this question is posed to the students themselves, they are quite likely to say something like, “Put the four names on individual slips of paper in a box, mix them up, and have someone draw out two names, sight unseen.”

The students are describing a random sample of two names from the four. Labeling the names as A, B, C, and D, the six possible samples of size two are ( A,B ), ( A,C ), ( A,D ), ( B,C ), ( B,D ), and ( C,D ). Because there is no reason to suspect that one sample is any more likely than another, the appropriate probability model for this random sampling assigns each of the six a probability of 1/6. This leads to a definition of a simple random sample for the general case in which n names are selected from the N names in the box, generally called the population. A simple random sample of n objects selected from a population of N distinct objects is a sample chosen so that all possible samples of size n have an equal chance of being selected.

It follows from basic rules of probability that the probability that person A gets selected, written P ( A ), is given by P ( A ) = 1/2; because three of the six equally likely samples have A in them. For the general case, the probability of any one object ending up in the sample is n /N. Thus the definition implies that, in simple random sampling, each individual has the same chance of being selected as any other individual. The reverse statement is not true, however. A method of sampling that gives each individual the same chance of being selected is not necessarily a simple random sample. For example, suppose A and B are male and C and D are female. Then selecting one male at random and one female at random gives each person the same chance (1/2) of being selected, but no sample would ever contain two boys or two girls.

If N and n were large, the process of physically drawing names from a box would be impractical if not impossible. Most often simple random samples are selected by a process that essentially numbers each of the N objects and then generates n random numbers between 1 and N by means of a calculator or computer, selecting as the sample those objects whose numbers were so generated.

SAMPLING WITHOUT AND WITH REPLACEMENT

The sampling scheme described above is referred to as sampling without replacement. Drawing the two names is mathematically the same as drawing one name at random and then drawing a second name from those that remain in the box. Using basic rules of probability, it follows that

P (selecting A and then selecting B )

= P ( A on first draw) · P ( B on second draw given that A is already selected)

= (1/4)(1/3) = 1/12.

To get the probability of the sample ( A,B ) this has to be doubled, because B could have been selected first and A second. Thus P ( A,B ) = 1/6, as shown above.

Suppose, however, that the two students receiving the reward could perform the task separately and at different times, so the same person could be selected twice. This could be accomplished by selecting one name at random, placing that name back in the box, and selecting the second name at random from the same set of four names. Under this scheme of sampling with replacement,

P (selecting A and then selecting B )

= P ( A on first draw) · P ( B on second draw given that A is already selected)

= (1/4)(1/4) = 1/16.

Because the probability of selecting B after A is the same as the probability of selecting B on the fist draw, the events “select A ” and “select B ” are said to be independent. Under independence (sampling with replacement), the probability of selecting two specific students changed from that of sampling without replacement. Suppose, however, that N were 40,000 instead of 4. Then removing one object would not appreciably change the probability on the second draw, and the counterparts of the two probabilities displayed above would be practically equal.

This fact leads to a second definition of a random sample, one based on independence. A random sample of size n is a set of n objects selected independently and at random from the same set of N objects.

If N is large compared to n, the second definition results in a probability structure for the sample that is approximately the same as that of the first definition. Moreover the second definition makes the statistical theory of sampling much easier to work out. That fact, coupled with the fact that the most common uses of random sampling, sample surveys and opinion polls generally involve large populations, makes the second definition more useful in practice.

Populations have been discussed thus far as if they were well-defined entities that the sampler could actually see or list. Consider a classical die toss using a balanced die. The population of possible outcomes of die tosses is infinite and conceptual, but it can readily be modeled as if the possible outcomes, 1 through 6, were represented in equal numbers because it seems logical to think of each possible outcome as having probability 1/6. Using the second definition then, a set of n outcomes generated from independent tosses of a balanced die can be considered as a random sample, each outcome being randomly (through the tossing) and independently (assuming the second toss is not influenced by the first) selected from this large, conceptual population.

RANDOM SAMPLING AND STATISTICAL INFERENCE

Random sampling forms the probabilistic basis for statistical inference. Consider the common opinion poll in which n people are randomly sampled from a large population of N people. Because N is usually large, the independence

model provides a good approximation, even though the samples are selected without replacement. The goal of such a poll is to estimate a population proportion, p, such as the proportion of voters that favor a certain candidate. The estimate of p is the corresponding sample proportion, often designated by p̂, of voters in the sample who favor the candidate. Due to the random sampling and the mathematics of probability, many facts can be established with regard to p̂ . If n is large, p has a small chance of being “far” from the true p it is estimating. In fact p will be within a distance of approximately from p about 95 percent of the time in repeated sampling. (This is why a poll of 1,000 persons is said to have a margin of error of about 3 percent.) If a poll is done many times, it will give results close to p most of the time but can miss by a large amount on occasion (about 5 percent of the time).

The facts stated above are related the result that the values of p̂ will have an approximately normal distribution (mound-shaped symmetric distribution) when random samples of the same size are taken repeatedly from the same population. The figure below shows a simulated distribution of 200 p̂ values for samples of size 50 taken for a population with p = 0.40. The theoretical margin of error is approximately , indicating that a p̂ of 0.54 or more would be an extremely rare occurrence. Such an outcome occurred only 7 times out of 200 in this simulation (see figure 1).

THEORY VERSUS PRACTICE

All of the above presents a neat theory of random sampling, but obtaining a truly random sample in practice is nearly impossible for most sampling situations. Consider the relatively simple situation of selecting a sample of students from a college. First, the population to be sampled needs a clear definition (but what constitutes a student?). Second, a list of students will be needed (but this changes almost every day). Third, even with a good definition and a good list, the sampled students may not be able to be found or, if found, may not be willing to respond (or respond correctly) to a survey. These three concerns, population definition, population dynamics, and nonresponse, cover many of the practical difficulties that occur in nearly all sampling problems. They are compounded with others in surveys of subjects who are hard to find anyway, such as victims of war atrocities. In many situations the difficulties can be mitigated by a more complex sampling design, the most common features of which involve stratification (divide the population into nonover-lapping but relatively large groups and take a random sample from each) and clustering (divide the population into many relatively small nonoverlapping groups and take a random sample of the groups). A national survey that selects respondents by state is an example of stratification. A city survey that samples city blocks, rather than households, and then interviews someone in each household in every sampled block is an example of clustering.

SEE ALSO Central Limit Theorem; Probability; Research, Survey; Sample Attrition; Sampling; Survey