Standardized testing as a gateway to higher education was first established in the United States with the development of the College Entrance Examination Board in 1900. This board created a test designed to standardize admissions to elite universities in the northeastern United States and to encourage the development of a common curriculum among elite boarding schools (Chandler 1999; Lemann 1999). The original test consisted of essays and was not designed for mass testing. The College Board, however, created a broader test of IQ in 1926, the Scholastic Aptitude Test, commonly known today as the SAT I. This test was intended to help elite schools identify high-achieving students in nonelite high schools. In the early years, it also distinguished between white-collar students who were suitable for college and blue-collar students presumed to be ill prepared for such an education (Blau et al. 2003). By the mid-1950s, the demand for college education soared, spawning the development of the American College Testing Program (currently known as the ACT) in 1959. This test is the main alternative to the SAT. The origins of the SAT and ACT clarify their differing approaches. The SAT test was originally based on Army IQ tests as a measure of intelligence, whereas the ACT was historically designed to measure achievement rather than intelligence or aptitude.
Despite these differences in intent, the tests are similar in structure. The SAT I (also known as the SAT Reasoning Test) is designed to measure students’ critical thinking and problem-solving skills. The test consists of three sections. The critical-reading section includes questions on analytical reading, reading comprehension, and sentence completion. The writing section evaluates students’ ability to write clearly, concisely, and competently. It also assesses students’ ability to critically assess sentence and paragraph structure, as well as grammar. Finally, the mathematics section includes questions covering algebra, geometry, statistics, and advanced data analysis.
The ACT is similar to the SAT, but it has four broad sections. The English section evaluates writing and rhetorical skills. The mathematics section includes questions on algebra, geometry, and trigonometry. The reading section measures reading comprehension. The science section tests scientific skills including reasoning, analysis, and problem solving. Finally, the writing section tests writing skills.
The SAT and ACT are widely utilized among students and colleges. In 2006 about 1.5 million high school seniors took the SAT and approximately 1.2 million students took the ACT. Most colleges accept either the SAT or ACT for admissions since standards for comparing these scores are easily accessible.
The importance to higher education of standardized testing persists into graduate school, but testing tools are more diverse for graduate admissions than for undergraduate admissions. Professional schools require standardized tests that emphasize skills required by their specific disciplines. These include the Law School Admission Test (LSAT), the Medical College Admission Test (MCAT), and the Graduate Management Admission Test (GMAT). A more general and widely used standardized testing tool is the Graduate Record Examination (GRE). In 2005 nearly 500,000 persons took the GRE, accounting for 35 percent of persons with bachelor’s degrees (National Center for Education Statistics 2006). The GRE has three sections. The verbal-reasoning section tests the respondent’s ability to recognize concepts and to analyze information and relationships among parts of sentences. The quantitative-reasoning section tests algebra, geometry, data analysis, and quantitative reasoning. Finally, the analytical-writing section assesses the respondent’s ability to write clearly, effectively, logically, coherently, and competently.
Key criticisms of standardized testing that have generated widespread sociological interest are: (1) the neglect of environmental differences among students, particularly those associated with cultural and racial differences; and (2) testing bias and validity. Criticisms of cultural and racial bias abound within the literature. One notable example is Tukufu Zuberi’s Thicker than Blood (2001). Zuberi contends that the IQ test, the predecessor of modern standardized testing, developed out of the eugenics movement. This movement was committed to identifying biological differences between the races and classes. Proponents posited that racial inequalities in society were biologically determined because whites were perceived to be genetically superior. According to Zuberi, IQ tests provided statistical support for eugenics because white students scored higher on these tests than black and immigrant students.
During this period, many scholars argued that IQ tests, which measured math and verbal skills, accurately reflected biological differences in intelligence. Scholars influenced by this tradition purport that differences in test scores between blacks and whites reflect inherent biological differences between the races (Herrnstein and Murray 1994). However, sufficient data have not been provided to support this hypothesis. Today, most scholars acknowledge that standardized testing is biased and reflects more than biological differences between students.
Christopher Jencks, an influential scholar in this debate, has identified multiple biases in standardized testing. First, he argues that standardized tests neglect environmental differences between students, which creates bias. Comparisons of scores among racial groups are problematic because the IQ test was originally designed to compare the mental ability of students who were raised in comparable environments with similar levels of educational opportunity. Yet mass testing neglects environmental differences between students. This proposition has received widespread empirical support. William Rodgers and William Spriggs (1996) offer one of the most methodologically sophisticated assessments of environmental background factors by showing that a consideration of family and educational background reduces racial differences in test scores. However, they also find that the impact of the environment on test scores varies by race. Furthermore, racial biases exist in the measurement of standardized tests because components of these tests have different long-term effects on individuals’ wages, depending on race and gender. Thus, Rodgers and Spriggs argue that standardized tests are racially biased because they measure different factors for different races.
Standardized tests are also biased in content (Jencks 1998). This is obvious when considering the language of the test. For students whose primary language is not English, standardized testing measures both English proficiency and scholastic achievement. As a result, these tests do not accurately reflect the achievement or readiness for college of language-minority students (LaCelle-Peterson 2000). Less-obvious content biases are prevalent in vocabulary words and essay topics.
Standardized tests are also biased methodologically if they claim (or are assumed) to measure ability because groups historically subjugated in society, including blacks, women, and the poor, are disadvantaged in this situation (Jencks 1998). Indeed, researchers have found that these tests create anxiety among African Americans and students of low socioeconomic status, who underperform on tests perceived to measure intellectual ability (Croizet and Dutrevis 2004; Steele and Aronson 1998). Additionally, women underperform when gender stereotypes are made salient (Benbow 1988).
Judith Blau and colleagues argue that blacks and whites place different significance on achievement tests. Whites believe that these tests measure ability, while blacks perceive unfair discrimination in testing practices. Thus, they conclude that black students and their parents place less weight on standardized test scores when considering postsecondary educational goals. Blau finds that test scores are a better predictor of educational attainment for white students than for black students. Furthermore, low-scoring black students are more likely than low-scoring white students to pursue postsecondary education. Thus, low scores are more likely to discourage white students, suggesting cultural differences in the value placed on tests (Blau et al. 2003). Further research is needed to determine how Blau’s theory applies to gender and class issues. Preliminary research suggests that females place less value on mathematical portions of standardized test scores due to stereotype threats (Lesko and Corpus 2006).
Differences in the ability of standardized tests to predict future outcomes highlight an additional criticism of standardized testing: The tests are not valid because they are not accurate predictors of students’ success in college or graduate school (Jencks 1998). Indeed, many scholars have found that standardized test scores do not predict grade point average in college (Gandara and Lopez 1998; Fleming 2000, 2002) or in graduate school (Oldfield 1994, 1996), and they do not predict success in the labor market (Blackburn 2004; Rodgers and Spriggs 1996).
The effect of the debates on standardized testing is evident. The title of the SAT has changed multiple times from the Scholastic Aptitude Test (a test of ability) to the Scholastic Assessment Test (this more general term suggests that the test measures more than ability) and finally to simply the SAT. In addition, the College Board has altered testing questions on the SAT to reduce cultural bias introduced from disparate knowledge and interests between groups in society. Furthermore, it has cut sections of the test to reduce reliance on vocabulary and increase reliance on verbal problem-solving skills. Even with these changes, however, racial and gender disparities persist. In 2004 the average SAT verbal score was 508 for college-bound high school seniors, ranging from 430 for black seniors to 451 and 528 for Mexican American and white seniors, respectively. Similarly, mathematics scores ranged from an average of 427 for black students to 531 for white students. ACT scores also vary by race. Average English ACT scores were 20.4 in 2004, ranging from 17.2 for black students to 18.3 for Mexican-American students and 22.5 for white students (Freeman and Fox 2005).
The persistence of the race gap is attributable to differences in family background and educational opportunity. Black students are generally raised in families with fewer resources than white students. Indeed, according to Melvin Oliver and Thomas Shapiro (1995), 63 percent of black households have zero or negative financial assets, meaning that their debt outweighs their assets. Only 28 percent of white families have negative financial assets. Furthermore, white median net worth (defined as the sum of all assets minus debt) is nearly twelve times black median net worth. This has important implications for test scores because students raised in families with greater wealth have the financial resources to prepare for standardized testing and attend college. Indeed, parental wealth and education are the two most important predictors of college attendance (Conley 1999).
The racial gap in test scores also persists because black students have fewer opportunities to prepare for the test. Schooling in the United States is highly segregated by race and socioeconomic status. Roslyn Mickelson (2006) found that predominantly black schools offer fewer SAT prep courses than integrated or predominantly white schools. Furthermore, even when black and white students study in the same schools (i.e., in integrated schools), they are offered different educational opportunities because they are grouped into classes by ability. Black students are more likely to be assigned to “lower-ability” classes than white students with the same grades and test scores. These classes are often taught by less-experienced teachers, and the courses offer a more general education rather than a college-preparatory education. Thus socioe-conomic resources and educational opportunities explain the existing gap in standardized test scores by race.
As for gender, the standardized test score disparity is not uniform. Historically, boys and girls had equivalent verbal scores, but boys scored higher in math (Benbow 1988). The math score gap has diminished over time, in part because girls’ educational opportunities have expanded, and they are taking more advanced math courses in high school. In 2004 boys and girls scored 538 and 504, respectively, on the math section of the SAT. Much of this remaining gender gap in test scores develops during high school because women continue to study in less rigorous math courses, and they are less likely than boys to participate in mathematically oriented extracurricular activities (Leahey and Guo 2001; Pallas and Alexander 1983; Vogt Yuan 2005).
It is important to understand what standardized tests measure because standardized testing has gained national recognition with the passage of the No Child Left Behind Act in 2002. This policy initiative requires standardized testing for students in the third through eighth grades and at least once during high school. The primary goal of the legislation is to reduce achievement gaps between students, particularly by race, poverty status, disability, ethnicity, and English proficiency. This act magnifies the significance of standardized testing. By neglecting the impacts of the environmental differences, testing biases, and validity issues discussed here, standardized testing will be of limited use to educators and policymakers as they seek to close achievement gaps.
SEE ALSO Education, USA; Race and Education
Benbow, Camilla Persson. 1988. Sex Differences in Mathematical Reasoning Ability in Intellectually Talented Preadolescents: Their Nature, Effects, and Possible Causes. Behavioral and Brain Sciences 11: 169–183.
Blackburn, M. L. 2004. The Role of Test Scores in Explaining Race and Gender Differences in Wages. Economics of Education Review 23: 555–576.
Blau, Judith, Stephanie Moller, and Lyle V. Jones. 2003. Going to College. In Race in the Schools: Perpetuating White Dominance? ed. Judith R. Blau, 177–202. Boulder, CO: Lynne Reinner.
Chandler, Michael, dir. 1999. Frontline: Secrets of the SAT. Boston. WGBH Educational Foundation. http://www.pbs.org/wgbh/pages/frontline/shows/sats/.
College Board. http://www.collegeboard.com.
Conley, Dalton. 1999. Being Black, Living in the Red: Race, Wealth, and Social Policy in America. Berkeley: University of California Press.
Croizet, Jean-Claude, and Marion Dutrevis. 2004. Socioeconomic Status and Intelligence: Why Test Scores Do Not Equal Merit. Journal of Poverty 8: 91–107.
Educational Testing Service: GRE—Graduate Record Examinations. http://www.ets.org/gre.
Fleming, Jacqueline. 2000. Affirmative Action and Standardized Test Scores. Journal of Negro Education 69: 27–37.
Fleming, Jacqueline. 2002. Who Will Succeed in College? When the SAT Predicts Black Students’ Performance. Review of Higher Education 25: 281–296.
Freeman, Catherine, and Mary Ann Fox. 2005. Status and Trends in the Education of American Indians and Alaska Natives. NCES 2005–108. Washington, DC: National Center for Education Statistics, Department of Education.
Gandara, Patricia, and Elias Lopez. 1998. Latino Students and College Entrance Exams: How Much Do They Really Matter? Hispanic Journal of Behavioral Sciences 21:17–38.
Herrnstein, Richard J., and Charles Murray. 1994. The Bell Curve: Intelligence and Class Structure in American Life. New York: Free Press.
Jencks, Christopher. 1998. Racial Bias in Testing. In The Black-White Test Score Gap, eds. Christopher Jencks and Meredith Phillips, 55–85. Washington, DC: Brookings Institution Press.
LaCelle-Peterson, Mark. 2000. Choosing Not to Know: How Assessment Policies and Practices Obscure the Education of Language Minority Students. In Assessment: Social Practice and Social Product, ed. Ann Filer, 27–42. London: Routledge.
Leahey, Erin, and Guang Guo. 2001. Gender Differences in Mathematical Trajectories. Social Forces 80: 713–732.
Lemann, Nicholas. 1999. The Big Test: The Secret History of the American Meritocracy. New York: Farrar, Straus and Giroux.
Lesko, Alexandra, and Jennifer H. Corpus. 2006. Discounting the Difficult: How High Math-Identified Women Respond to Stereotype Threat. Sex Roles: A Journal of Research 54: 113–125.
Mickelson, Roslyn. 2006. Segregation and the SAT. Ohio State Law Journal 67: 157–199.
National Center for Education Statistics. 2006. Digest of Education Statistics: 2005. NCES 2006–030. Washington, DC: Department of Education. http://nces.ed.gov/programs/digest/d05.
National Center for Education Statistics. 2005. Trends in Educational Equity of Girls and Women: 2004. NCES 2005–016. Washington, DC: Department of Education. http://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2005016.
Oldfield, Kenneth. 1994. On the Importance of Informing Students about the Potential Risk Associated with Taking the Graduate Record Exam. Journal of Thought 29: 61–70.
Oldfield, Kenneth. 1996. The Political and Economic Reasons the Graduate Record Examination Persists Despite Its Generally Low Predictive Validity. Journal of Thought 31: 55–68.
Oliver, Melvin L., and Thomas M. Shapiro. 1995. Black Wealth/White Wealth: A New Perspective on Racial Inequality. New York: Routledge.
Pallas, Aaron M., and Karl L. Alexander. 1983. Sex Difference in Quantitative SAT Performance New Evidence on the Differential Coursework Hypothesis. American Educational Research Journal 20: 165–182.
Rodgers, William M., and William E. Spriggs. 1996. What Does the AFQT Really Measure: Race, Wages, Schooling, and the AFQT Score. Review of Black Political Economy 24: 13–47.
Steele, Claude, and Joshua Aronson. 1998. Stereotype Threat and the Test Performance of Academically Successful African Americans. In The Black-White Test Score Gap, eds. Christopher Jencks and Meredith Phillips, 401–430. Washington, DC: Brookings Institution Press.
Vogt Yuan, Anastasia. 2005. Sex Differences in School Performance During High School: Puzzling Patterns and Possible Explanations. Sociological Quarterly 46: 299–321.
Zuberi, Tukufu. 2001. Thicker than Blood: How Racial Statistics Lie. Minneapolis: University of Minnesota Press.
"Standardized Tests." International Encyclopedia of the Social Sciences. . Encyclopedia.com. (October 18, 2017). http://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/standardized-tests
"Standardized Tests." International Encyclopedia of the Social Sciences. . Retrieved October 18, 2017 from Encyclopedia.com: http://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/standardized-tests
Standardized tests are administered in order to measure the aptitude or achievement of the people tested. A distribution of scores for all test takers allows individual test takers to see where their scores rank among others. Well-known examples of standardized tests include "IQ" (Intelligence Quota) tests, the PSAT (Preliminary Scholastic Achievement Test) and SAT (Scholastic Achievement Test) tests taken by high school students, the GRE (Graduate Requirements Examination) test taken by college students applying to graduate school, and the various admission tests required for business, law, and medical schools.
The "Normal" Curve
The mathematics behind the distribution of scores on standardized tests comes from the fields of probability theory and mathematical statistics. A cornerstone of this mathematical theory is the "Central Limit Theorem," which states that for large samples of observations (or scores in the case of standardized tests), the distribution of the observations will follow the bell-shaped normal probability curve illustrated below. This means that most of the observations will cluster symmetrically around the mean or average value of all the observations, with fewer observations farther away from the mean value.
One measure of the spread or dispersion of the observations is called the standard deviation . According to statistical theory illustrated above, about 68 percent of all observations will lie within plus or minus one standard deviation of the mean; 95 percent will lie within plus or minus two standard deviations of the mean (see graph below); and 99.7 percent will lie within plus or minus three standard deviations of the mean. Standardized test scores are examples of observations that have this property.
Consider, for example, a standardized test for which the mean score is 500 and the standard deviation is 100. This means that about 68 percent of all test takers will have scores that fall between 400 and 600; 95 percent will have scores between 300 and 700; and virtually all of the scores will fall between 200 and 800. In fact, many standardized tests, including the PSAT and SAT, have just such a scale on which 200 and 800 are the minimum and maximum scores, respectively, that will be given.
The "standardized" in standardized tests means that similar scores must represent the same level of performance from year to year. Statisticians and test creators work together to ensure that, for example, if a student scores 650 on one version of the SAT as a junior and 700 on a different version as a senior, that this truly represents a gain in achievement rather than one version of the test being more difficult than the other.
By "embedding" some questions that are identical in all versions of a test and analyzing the performance of each group on those common questions, test creators can ensure a level of standardization. If one group scores significantly lower on the common questions, this is interpreted to mean that the lower scoring group is not as strong as the higher scoring group.
If group A scores higher than group B on questions identical to both their tests but then scores the same or lower than group B on the complete test, it would be assumed that the test given to group A was more difficult than that given to group B. Statisticians can develop a mathematical formula that will correct for such a variance in the difficulty of tests.
Such a formula would be applied to the "raw" scores of the test takers in order to obtain "scaled" scores for both groups. These scaled scores could then be compared. A scaled score of 580 on version A means the same thing as a scaled score of 580 on version B, even though the raw scores may be different. In this sense the scores are said to have been "standardized."
A second meaning of "standardized" is more subtle, more mathematically involved, and not well understood by the general public. This meaning has to do with the bell-shaped normal probability curve mentioned at the beginning of this article. Theoretically, there are an infinite number of normal curves—one for each different set of observations that might be made. Mathematicians would say that there is an entire "family" of normal curves, and, the members of the normal curve family share similarities as well as differences.
All bell-shaped curves are high in the middle and slope down to long "tails" to the right and left. Although different types of observations will have different mean values, those mean values will always occur at the middle of the distributions. They may also have different standard deviations as discussed earlier, but the percentage of values lying between plus or minus one of those standard deviations will still be about 68 percent, the percentage of values lying between plus or minus two standard deviations will still be about 95 percent, and so on.
In order to make the analysis of normal distributions simpler, statisticians have agreed upon one particular normal curve that will represent all the rest. This special normal curve has a mean of 0 and a standard deviation of 1 and is called the "standard normal curve." A "standardized" test result, therefore, is one based on the use of a standard normal curve as its reference.
The advantage of having the standard normal curve represent all the other normal curves is that statisticians can then construct a single table of probabilities that can be applied to all normal distributions. This can be done by "mapping" those distributions onto the standard normal curve and making use of its probability table. The term "mapping" in mathematics refers to the transformation of one set of values to a different set of values.
To illustrate, consider the test with a mean of 500 and a standard deviation of 100. The mean of this set of scores lies 500 units to the right of the standard normal distribution's mean of 0. So to "map" the mean of the test scores onto the standard normal mean, 500 is subtracted from all the test scores. Now there is a new distribution with the correct mean but the wrong standard deviation.
To correct this, all of the scores in the new distribution are divided by 100, since , which is the standard deviation of the standard normal distribution. The two distributions are now identical. In mathematical terms the test scores have been "mapped" onto the standard normal values.
This mapping is composed of two transformations: a translation of 500 to the left and a scale change of 1/100. This composition can be represented by , where x is any test score.
Building on this example, suppose one wants to know the percentage of test takers who scored 650 or above. First, compute . Then go to a standard normal table, look up a standard score of 1.5, and see that about 6.88 percent of standard normal scores are at 1.5 or above. This means that about 6.88 percent of the test scores are 650 or higher. This procedure may be used with any normally distributed data set for which the mean and standard deviation are known.
see also Central Tendency, Measures of; Mapping, Mathematical; Statistical Analysis; Transformations.
Angoff, William. "Calibrating College Board Scores." In Statistics: A Guide to the Unknown, ed. Judith Tanur, Frederick Mosteller, William H. Kruskal, Richard F. Link, Richard S. Pieters, and Gerald R. Rising. San Francisco: Holden-Day, Inc.,1978.
Blakeslee, David W., and William Chin. Introductory Statistics and Probability. Boston: Houghton Mifflin Company, 1975.
Narins, Brigham, ed. World of Mathematics. Detroit: Gale Group, 2001.
"Standardized Tests." Mathematics. . Encyclopedia.com. (October 18, 2017). http://www.encyclopedia.com/education/news-wires-white-papers-and-books/standardized-tests
"Standardized Tests." Mathematics. . Retrieved October 18, 2017 from Encyclopedia.com: http://www.encyclopedia.com/education/news-wires-white-papers-and-books/standardized-tests
A test administered to a group of subjects under exactly the same experimental conditions and scored in exactly the same way.
Standardized tests are used in psychology, as well as in everyday life, to measure intelligence , aptitude, achievement, personality , attitudes and interests. Attempts are made to standardize tests in order to eliminate biases that may result, consciously or unconsciously, from varied administration of the test. Standardized tests are used to produce norms—or statistical standards— that provide a basis for comparisons among individual members of the group of subjects. Tests must be standardized, reliable (give consistent results), and valid (reproducible) before they can be considered useful psychological tools.
Standardized tests are highly controversial both in psychological circles and particularly in education because true standardization is difficult to attain. Certain requirements must be rigidly enforced. For example, subjects must be given exactly the same amount of time to take the test. Directions must be given using precisely the same wording from group to group, with no embellishments, encouragement, or warnings. Scoring must be exact and consistent. Even an unwitting joke spoken by the test administrator that relaxes the subjects or giving a test in a room that is too hot or too cold could be considered violations of standardization specifications. Because of the difficulty of meeting such stringent standards, standardized tests are widely criticized.
Critics of the use of standardized tests for measuring educational achievement or classifying children are critical for other reasons as well. They say the establishment of norms does not give enough specific information about what children know. Rather, they reveal the average level of knowledge. Secondly, critics contend that such tests encourage educators and the public to focus their attention on groups rather than on individuals. Improving tests scores to enhance public image or achieve public funding become more of a focus than teaching individual children the skills they need to advance. Another criticism is that the tests, by nature, cannot measure knowledge of complex skills such as problem solving and critical thinking. "Teaching to the test"—drilling students in how to answer fill-in-the-blank or multiple-choice questions— takes precedence over instruction in more practical, less objective skills such as writing or logic.
Achievement tests , I.Q. tests, and the Stanford-Binet intelligence scales are examples of widely used standardized tests.
Houts, Paul L., ed. The Myth of Measurability. New York: Hart Publishing Co., 1977.
Wallace, Betty, and William Graves. Poisoned Apple: The Bell-Curve Crisis and How Our Schools Create Mediocrity and Failure. New York: St. Martin's Press, 1995.
Zimbardo, Philip G. Psychology and Life. Glenview, IL: Scott, Foresman, 1988.
"Standardized Test." Gale Encyclopedia of Psychology. . Encyclopedia.com. (October 18, 2017). http://www.encyclopedia.com/medicine/encyclopedias-almanacs-transcripts-and-maps/standardized-test
"Standardized Test." Gale Encyclopedia of Psychology. . Retrieved October 18, 2017 from Encyclopedia.com: http://www.encyclopedia.com/medicine/encyclopedias-almanacs-transcripts-and-maps/standardized-test