You may have heard the saying "You can prove anything with statistics," which implies that statistical analysis cannot to be trusted, that the conclusions that can be drawn from it are so vague and ambiguous that they are meaningless. Yet the opposite is also true. Statistical analysis can be reliable and the results of statistical analysis can be trusted if the proper conditions are established.
What Is Statistical Analysis?
Statistical analysis uses inductive reasoning and the mathematical principles of probability to assess the reliability of a particular experimental test. Mathematical techniques have been devised to allow measurement of the reliability (or fallibility) of the estimate to be determined from the data (the sample, or "N") without reference to the original population. This is important because researchers typically do not have access to information about the whole population, and a sample—a subset of the population— is used.
Statistical analysis uses a sample drawn from a larger population to make inferences about the larger population. A population is a well-defined group of individuals or observations of any size having a unique quality or characteristic. Examples of populations include first-grade teachers in Texas, jewelers in New York, nurses at a hospital, high school principals, Democrats, and people who go to dentists. Corn plants in a particular field and automobiles produced by a plant on Monday are also populations. A sample is the group of individuals or items selected from a particular population. A random sample is taken in such a way that every individual in the population has an equal opportunity to be chosen. A random sample is also known as an unbiased sample.
Most mail surveys, mall surveys, political telephone polls, and other similar data gathering techniques generally do not meet the proper conditions for a random, unbiased sample, so their results cannot to be trusted. These are "self-selected" samples because the subjects choose whether to participate in the survey and the subjects may be picked based on the ease of their availability (for example, whoever answers the phone and agrees to the interview).
Selecting a Random Sampling
The most important criterion for trustworthy statistical analysis is correctly choosing a random sample. For example, suppose you have a bucket full of 10,000 marbles and you want to know how many of the marbles are red. You could count all of the marbles, but that would take a long time. So, you stir the marbles thoroughly and, without looking, pull out 100 marbles. Now you count the red marbles in your random sample. There are 10. Thus you could conclude that approximately 10 percent of the original marbles are red. This is a trustworthy conclusion, but it is not likely to be exactly right. You could improve your accuracy by counting a larger sample; say 1,000 marbles. Of course if you counted all the marbles, you would know the exact percentage, but the point is to pick a sample that is large enough (for example, 100 or 1,000) that gives you an answer accurate enough for your purposes.
Suppose the 100 marbles you pulled out of the bucket were all red. Would this be proof that all 10,000 marbles in the bucket were red? In science, statistical analysis is used to test a hypothesis . In the example we are testing, the hypothesis would be "all the marbles in the bucket are red."
Statistical inference makes it possible for us to state, given a sample size (100) and a population size (10,000), how often false hypotheses will be accepted and how often true hypotheses are rejected. Statistical analysis cannot conclusively tell us whether a hypothesis is true; only the examination of the entire population can do that. So "statistical proof" is a statement of just how often we will get "the right answer."
Using Basic Statistical Concepts
Statistics is applicable to all fields of human endeavor, from economics to education, politics to psychology. Procedures worked out for one field are generally applicable to the other fields. Some statistical procedures are used more often in some fields than in others.
Example 1. Suppose the Wearemout Pants Company wants to know the average height of adult American men, an important piece of information for a clothing manufacturer producing pants. The population is all men over the age of 25 who live in the United States. It is logistically impossible to measure the height of every man who lives in the United States, so a random sample of around 1,000 men is chosen. If the sample is correctly chosen, all ethnic groups, geographic regions, and socioeconomic classes will be adequately represented. The individual heights of these 1,000 men are then measured. An average height is calculated by dividing the sum of these individual heights by the total number of subjects (N = 1,000). By doing so, imagine that we calculate an average height is 1.95 meters (m) for this sample of adult males in the United States. If a representative sample was selected, then this figure can be generalized to the larger population.
The random sample of 1,000 men probably included some very short men and some very tall men. The difference between the shortest and the tallest is known as the "range" of the data. Range is one measure of the "dispersion" of a group of observations. A better measure of dispersion is the "standard deviation." The standard deviation is the square root of the sum of the squares of the differences divided by one less than the number of observations.
In this equation, xi is an observed value and is the arithmetic mean.
In our example, if a smaller height interval is used (1.10 m, 1.11 m, 1.12 m, 1.13 m, and so on) and the number of men in each height interval plotted as a function of height a smooth curve can be drawn which would have a characteristic shape, known as a "bell" curve or "normal frequency distribution." A normal frequency distribution can be stated mathematically as
The value of sigma (σ) is a measure of how "wide" the distribution is. Not all samples will have a normal distribution, but many do, and these distributions are of special interest.
The following figure shows three normal probability distributions. Because there is no skew, the mean, median, and mode are the same. The mean of curve (a) is less than the mean of curve (b), which in turn is less than the mean of (c). Yet the standard deviation, or spread, of (c) is least, whereas that of (a) is greatest. This is just one illustration of how the parameters of distributions can vary.
Example 2. One of the most common uses of statistical analysis is in determining whether a certain treatment is efficacious. For example, medical researchers may want to know if a particular medicine is effective at treating the type of pain resulting from extraction of third molars (known as "wisdom" teeth). Two random samples of approximately equal size would be selected. One group would receive the pain medication while the other group received a "placebo," a pill that looked identical but contained only inactive ingredients. The study would need to be a "double-blind" experiment, which is designed so that neither the recipients nor the persons dispensing the pills knew which was which. Researchers would know who had received the active medicines only after all the results were collected.
Example 3. Suppose a student, as part of a science fair project, wishes to determine if a particular chemical compound (Chemical X) can accelerate the growth of tomato plants. In this sort of experiment design, the hypothesis is usually stated as a null hypothesis : "Chemical X has no effect on the growth rate of tomato plants." In this case, the student would reject the null hypothesis if she found a significant difference. It may seem odd, but that is the way most of the statistical tests are set up. In this case, the independent variable is the presence of the chemical and the dependent variable is the height of the plant.
The next step is experiment design. The student decides to use height as the single measure of plant growth. She purchases 100 individual tomato plants of the same variety and randomly assigns them to 2 groups of 50 each. Thus the population is all tomato plants of this particular type and the sample is the 100 plants she has purchased. They are planted in identical containers, using the same kind of potting soil and placed so they will receive the same amount of light and air at the same temperature. In other words, the experimenter tries to "control" all of the variables, except the single variable of interest. One group will be watered with water containing a small amount of the chemical while the other will receive plain water. To make the experiment double-blind, she has another student prepare the watering cans each day, so that she will not know until after the experiment is complete which group was receiving the treatment. After 6 weeks, she plans to measure the height of the plants.
The next step is data collection. The student measures the height of each plant and records the results in data tables. She determines that the control group (which received plain water) had an average (arithmetic mean) height of 1.3 m (meters), while the treatment group had an average height of 1.4 m.
Now the student must somehow determine if this small difference was significant or if the amount of variation measured would be expected under no treatment conditions. In other words, what is the probability that 2 groups of 50 tomato plants each, grown under identical conditions would show a height difference of 0.1 m after 6 weeks of growth? If this probability is less than or equal to a certain predetermined value, then the null hypothesis is rejected. Two commonly used values of probability are 0.05 or 0.01. However, these are completely arbitrary choices determined mostly by the widespread use of previously calculated tables for each value. Modern computer analysis techniques allow the selection of any value of probability.
The simplest test of significance is to determine how "wide" the distribution of heights is for each group. If there is a wide variance (σ) in heights (say, σ = 25), then small differences in mean are not likely to be significant. On the other hand, if the dispersion is narrow (for example, if all the plants in each group were close to the same height, so that σ = 0.1) then the difference would probably be significant.
There are several different tests the student could use. Selecting the right test is often a tricky problem. In this case, the student can reject several tests outright. For example, the chi-square test is suitable for nominal scales (yes or no answers are one example of nominal scales), so it does not work here. The F -test measures variability or dispersion within a single sample. It too is not suitable for comparing two samples. Other statistical tests can also be rejected as inappropriate for various reasons.
In this case, since the student is interested in comparing means, the best choice is a t test. The t -test compares two means using this formula:
In this case, the null hypothesis assumes that μ1 − μ2 = 0 (no difference in the sample groups), so that we can say:
The quantity is known as the standard error of the mean difference. When the sample sizes are the same, . The standard error of the mean difference is the square root of the sums of the squares of the standard errors of the means for each group. The standard error of the mean for each group is easily calculated from . N is the sample size, 50, and the student can calculate the standard deviation by the formula for standard deviation given above.
The final experimental step is to determine sensitivity. Generally speaking, the larger the sample, the more sensitive the experiment is. The choice of 50 tomato plants for each group implies a high degree of sensitivity.
Students, teachers, psychologists, economists, politicians, educational researchers, medical researchers, biologists, coaches, doctors and many others use statistics and statistical analysis every day to help them make decisions. To make trustworthy and valid decisions based on statistical information, it is necessary to: be sure the sample is representative of the population; understand the assumptions of the procedure and use the correct procedure; use the best measurements available; keep clear what is being looked for; and to avoid statements of causal relations if they are not justified.
see also Central Tendency, Measures of; Data Collection and Interpretation; Graphs; Mass Media, Mathematics and the.
Huff, Darrell. How to Lie With Statistics. New York: W. W. Norton & Company, 1954.
Kirk, Roger E. Experimental Design, Procedures for the Behavioral Sciences. Monterrey, CA: Brooks/Cole Publishing Company, 1982.
Paulos, John Allen. Innumeracy: Mathematical Illiteracy and Its Consequences. New York: Hill & Wang, 1988.
Tufte, Edward R. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 1983.
MEAN, MEDIAN, AND MODE
The average in the clothing example is known as the "arithmetic mean," which is one of the measures of central tendency. The other two measures of central tendency are the"median" and the "mode." The median is the number that falls in the mid-point of an ordered data set, while the mode is the most frequently occurring value.
Classical Statistical Analysis
Classical Statistical Analysis
Classical statistical analysis seeks to describe the distribution of a measurable property (descriptive statistics) and to determine the reliability of a sample drawn from a population (inferential statistics). Classical statistical analysis is based on repeatedly measuring properties of objects and aims at predicting the frequency with which certain results will occur when the measuring operation is repeated at random or stochastically.
Properties can be measured repeatedly of the same object or only once per object. However, in the latter case, one must measure a number of sufficiently similar objects. Typical examples are measuring the outcome of tossing a coin or rolling a die repeatedly and count the occurrences of the possible outcomes as well as measuring the chemical composition of the next hundred or thousand pills produced in the production line of a pharmaceutical plant. In the former case the same object (one and the same die cast) is “measured” several times (with respect to the question which number it shows); in the latter case many distinguishable, but similar objects are measured with respect to their composition which in the case of pills is expected to be more or less identical, such that the repetition is not with the same object, but with the next available similar object.
One of the central concepts of classical statistical analysis is to determine the empirical frequency distribution that yields the absolute or relative frequency of the occurrence of each of the possible results of the repeated measurement of a property of an object or a class of objects when only a finite number of different outcomes is possible (discrete case). If one thinks of an infinitely repeated and arbitrarily precise measurement where every outcome is (or can be) different (as would be the case if the range of the property is the set of real numbers), then the relative frequency of a single outcome would not be very instructive; instead one uses the distribution function in this (continuous) case which, for every numerical value x of the measured property, yields the absolute or relative frequency of the occurrence of all values smaller than x. This function is usually noted as F (x ), and its derivative F’ (x ) = f (x ) is called frequency density function.
If one wants to describe an empirical distribution, the complete function table is seldom instructive. This is why the empirical frequency or distribution functions are often represented by a few parameters that describe the essential features of the distribution. The so-called moments of the distribution represent the distribution completely, and the lower-order moments represent the distribution at least in a satisfactory manner. Moments are defined as follows:
where k is the order of the moment, n is the number of repetitions or objects measured, and c is a constant that is usually either 0 (moment about the origin) or the arithmetic mean (moment about the mean), the first-order mean about the origin being the arithmetic mean.
In the frequentist interpretation of probability, frequency can be seen as the realization of the concept of probability: It is quite intuitive to believe that if the probability of a certain outcome is some number between 0 and 1, then the expected relative frequency of this outcome would be the same number, at least in the long run. From this, one of the concepts of probability is derived, yielding probability distribution and density functions as models for their empirical correlates. These functions are usually also noted as f (x ) and F (x ), respectively, and their moments are also defined much like in the above formula, but with a difference that takes into account that there is no finite number n of measurement repetitions:
where the first equation can be applied to discrete numerical variables (e.g., the results of counting), while the second equation can be applied to continuous variables. Again, the first-order moment about 0 is the mean, and the other moments are usually calculated about this mean. In many important cases one would be satisfied to know the mean (as an indicator for the central tendency of the distribution) and the second-order moment about the mean, namely the variance (as the most prominent indicator for the variation). For the important case of the normal or Gaussian distribution, these two parameters are sufficient to describe the distribution completely.
If one models an empirical distribution with a theoretical distribution (any non-negative function for which the zero-order moment evaluates to 1, as this is the probability for the variable to have any arbitrary value within its domain), one can estimate its parameters from the moments of the empirical distributions calculated from the finite number of repeated measurements taken in a sample, especially in the case where the normal distribution is a satisfactory model of the empirical distribution, as in this case mean and variance allow the calculation of all interesting values of the probability density function f (x ) and of the distribution function F (x ).
Empirical and theoretical distributions need not be restricted to the case of a single property or variable, they are also defined for the multivariate case. Given that empirical moments can always be calculated from the measurements taken in a sample, these moments are also results of a random process, just like the original measurements. In this respect, the mean, variance, correlation coefficient or any other statistical parameter calculated from the finite number of objects in a sample is also the outcome of a random experiment (measurement taken from a randomly selected set of objects instead of exactly one object). And for these derived measurements theoretical distributions are also available, and these models of the empirical moments allow the estimation with which probability one could expect the respective parameter to fall into a specified interval in the next sample to be taken.
If, for instance, one has a sample of 1,000 interviewees of whom 520 answered they were going to vote for party A in the upcoming election, and 480 announced they were going to vote for party B, then the parameter πA—the proportion of A-voters in the overall population—could be estimated to be 0.52, but this estimate would be a stochastic variable, which approximately obeys a normal distribution with mean 0.52 and variance 0.0002496 (or standard deviation 0.0158), and from this result one can conclude that another sample of another 1,000 interviewees from the same overall population would lead to another estimate whose value would lie within the interval [0.489, 0.551] (between 0.52 ± 1.96 0.0158) with a probability of 95 percent (the so-called 95 percent confidence interval, which in the case of the normal distribution is centered about the mean with a width of 3.92 standard deviations). Or, to put it in other words, the probability of finding more than 551 A-voters in another sample of 1,000 interviewees from the same population is 0.025. Bayesian statistics, as opposed to classical statistics, would argue from the same numbers that the probability is 0.95 that the population parameter falls within the interval [0.489, 0.551].
SEE ALSO Bayesian Statistics; Descriptive Statistics; Inference, Bayesian; Inference, Statistical; Sampling; Variables, Random
Hoel, Paul G. 1984. Introduction to Mathematical Statistics. 5th ed. Hoboken, NJ: Wiley.
Iversen, Gudmund. 1984. Bayesian Statistical Inference. Beverly Hills, CA: Sage.
Klaus G. Troitzsch
Throughout conflicts, apologists for the side in power often excuse atrocities committed by their side with the claim that "violations are being committed on all sides of the conflict." The objective of such a statement is to render the parties morally equivalent, thereby relieving observers of the responsibility or duty to make a judgment about whether one side is the aggressor and the other is acting in self-defense. Even when the greater historical narrative involves more than these labels imply, in situations of massive human rights violations the perpetrators are rarely balanced in power. Although it may be literally true that all parties to a conflict have committed at least one violation, often the number of violations each party commits differs by a factor of ten or more relative to their opponents. In some cases quantitative analysis may offer a method for assessing claims about moral responsibility for crimes against humanity, including genocide. Statistics provide a way to measure crimes of policy—massive crimes that result from institutional or political decisions.
Although all parties may be guilty, they are rarely guilty in equal measure. Only with quantitative arguments can the true proportions of responsibility be understood. In this way one can transcend facile claims about "violations on all sides" in favor of an empirically rich view of responsibility for atrocities. Did the monthly number of killings increase or decrease in the first quarter of 1999? Were there more violations in Province A or in Province B? Were men more affected than women, or adults relative to children? These simple quantitative evaluations may be important questions when linked to political processes. Perhaps a new government took power and one needs to assess its impact on the state's respect for human rights. Or a military officer may move from Province A to Province B, and one may wish to determine if he is repeating the crimes he committed in Province A. Simple descriptive statistics based on properly gathered data can address these questions more precisely than the kinds of casual assessments that nonquantitative observers often make.
There are three areas in which nonquantitative analysts most often make statistical mistakes: estimating the total magnitude of violations; understanding how bias may have affected the data collection or interpretation; and comparing the relative proportions of responsibility among perpetrators. Poor information management and inappropriate statistical analysis can lead to embarrassing reversals of findings once proper methods are applied.
The use of statistical methods that demonstrably control biases and enable estimates of total magnitude can give analysts a rigorous basis for drawing conclusions about politically important questions. One such method, multiple systems estimation, uses three or more overlapping lists of some event (such as killings) to make a statistical estimate of the total number of events, including those events excluded from all three lists. "Overlapping" in this sense means events that are documented on two or more lists. The estimate made by this technique can control for several biases that might affect the original reporting which led to the lists of events.
For example, among the most important questions the Guatemalan Commission for Historical Clarification (CEH is the Spanish acronym) had to answer was whether the army had committed acts of genocide against the Maya. Using qualitative sources and field investigation, the CEH identified six regions in which genocide might have occurred. Data were collated from testimonies given to three sources: nongovernmental organizations (NGOs), the Catholic Church, and the CEH.
If genocide has been committed, then at least two statistical indicators should be clear. First, the absolute magnitude of the violations should be large. Second, there should be a big difference in the rate of killing between those who are in the victim group versus those people in the same region who are not in the victim group. It is inadequate to argue that some large number of people of specific ethnicities have been killed, because it might have been that they were simply unfortunate enough to live in very violent areas. Killing in an indiscriminate pattern might be evidence of some other crime, but if genocide occurred, a substantial difference in killing rates between targeted and nontargeted groups should exist. Thus, to find statistical evidence consistent with genocide, it is not enough that certain people were killed at high rate, but also that other nearby people were killed at much lower rates.
The CEH analysts conducted a multiple systems estimate of the total deaths of indigenous people and nonindigenous people between 1981 and 1983 in the six regions identified. For each group in each region, the estimated total number of deaths was divided into the Guatemalan government's census figures for indigenous and nonindigenous people in 1981. The CEH showed that resulting proportions were consistent with the genocide hypothesis. In each region indigenous people were killed at a rate five to eight times greater than nonindigenous people. This statistical finding was one of the bases of the CEH's final conclusion that the Guatemalan army committed acts of genocide against the Maya.
Other human rights projects have incorporated statistical reasoning. Sociologists and demographers have testified at the trial of Slobodan Milosevic and others tried before the International Criminal Tribunal for the Former Yugoslavia. They have provided quantitative insights on ethnic cleansing, forced migration, and the evaluation of explanatory hypotheses.
In the early twenty-first century, the statistical analysis of human rights violations is just beginning, and much work remains. New techniques should be developed, including easier methods for conducting random probability sampling in the field, richer demographic analysis of forced migration, and more flexible techniques for rapidly creating lots of graphical views of data. Human rights advocacy and analysis have benefited tremendously from the introduction of better statistical methods. The international community needs to continue to find new ways to employ existing methods, and to further research on new methods, so that human rights reporting becomes more rigorous. Statistics help establish the evidentiary basis of human rights allegations about crimes of policy.
Ball, Patrick (2000). "The Guatemalan Commission for Historical Clarification: Inter-Sample Analysis." In Making the Case: Investigating Large Scale Human Rights Violations Using Information Systems and Data Analysis, ed. Patrick Ball, Herbert F. Spirer, and Louise Spirer. Washington, D.C.: AAAS.
Ball, P., W. Betts, F. Scheuren, J. Dudukovic, and J. Asher (2002). Killings and Refugee Flow in Kosovo March–June 1999. Washington, D.C.: AAAS.
Brunborg, H., H. Urdal, and T. Lyngstad (2001). "Accounting for Genocide: How Many Were Killed in Srebrenica?" Paper presented at the Uppsala Conference on Conflict Data, Uppsala, June 8–9, 2001. Available from http://www.pcr.uu.se/conferenses/Euroconference/paperbrunborg.doc.
Ward, K. (2000). "The United Nations Mission for the Verification of Human Rights in Guatemala." In Making the Case: Investigating Large Scale Human Rights Violations Using Information Systems and Data Analysis, ed. Patrick Ball, Herbert F. Spirer, and Louise Spirer. Washington, D.C.: AAAS.