standardized tests and educational policy
johanna v. crighton
standardized tests and high-stakes assessment
lorrie a. shepard
statewide testing programs
test preparation programs, impact of
national achievement tests, international
ann e. flanagan
international standards of test development
STANDARDIZED TESTS AND EDUCATIONAL POLICY
The term standardized testing was used to refer to a certain type of multiple-choice or true/false test that could be machine-scored and was therefore thought to be "objective." This type of standardization is no longer considered capable of capturing the full range of skills candidates may possess. In the early twenty-first century it is more useful to speak of standards-based or standards-linked assessment, which seeks to determine to what extent a candidate meets certain specified expectations or standards. The format of the test or examination is less important than how well it elicits from the candidate the kind of performance that can give us that information.
The use of standardized testing for admission to higher education is increasing, but is by no means universal. In many countries, school-leaving (exit) examinations are nonstandardized affairs conducted by schools, with or without government guidelines, while university entrance exams are set by each university or each university department, often without any attempt at standardization across or within universities. The functions of certification (completion of secondary school) and selection (for higher or further education) are separate and frequently noncomparable across institutions or over time. In most countries, however, a school-leaving certificate is a necessary but not a sufficient condition for university entrance.
In the United States, states began using high school exit examinations in the late 1970s to ensure that students met minimum state requirements for graduation. In 2001 all states were at some stage of implementing a graduation exam. These are no longer the "minimum competency tests" of the 1970s and 1980s; they are based on curriculum and performance standards developed in all fifty states. All students should be able to demonstrate that they have reached performance standards before they leave secondary school.
Certification examinations based on state standards have long been common in European countries. They may be set centrally by the state and conducted and scored by schools (France); set by the state and conducted and scored by an external or semi-independent agency (the Netherlands, the United Kingdom, Romania, Slovenia); or set, conducted, and scored entirely within the schools themselves (Russian Federation), often in accordance with government guidelines but with no attempt at standardization or comparability. Since the objective is to certify a specified level of learning achieved, these exit examinations are strongly curriculum-based, essentially criterion-referenced, and ideally all candidates should pass. They are thus typically medium-or low-stakes, and failing students have several opportunities to retake the examination. Sometimes weight is given to a student's in-school performance as well as to exam results–in the Netherlands, for example, the weightings are 50-50, giving students scope to show a range of skills not easily measured by examination.
In practice, however, when constructing criteria for a criterion-referenced test, norm-referencing is unavoidable. Hidden behind each criterion is norm-referenced data: assumptions about how the average child in that particular age group can be expected to perform. Pure criterion-referenced assessment is rare, and it would be better to think of assessment as being a hybrid of norm-and criterion-referencing. The same is true of setting standards, especially if they have to be reachable by students of varying ability: one has to know something about the norm before one can set a meaningful standard.
By contrast, university entrance examinations aim to select some candidates rather than others and are therefore norm-referenced: the objective is not to determine whether all have reached a set standard, but to select "the best." In most cases, higher education expectations are insufficiently linked to K–12 standards.
Entrance exams are typically academic and high-stakes, and opportunities to retake them are limited. Where entrance exams are set by individual university departments rather than by an entire university or group of universities, accountability for selection is limited, and failing students have little or no recourse. University departments are unwilling to relinquish what they see as their autonomy in selecting entrants; moreover, the lack of accountability and the often lucrative system of private tutoring for entrance exams are barriers to a more transparent and equitable process of university selection.
In the United States the noncompulsory SATs administered by the Educational Testing Service (ETS) are most familiar. SAT I consists of quantitative and verbal reasoning tests, and for a number of years ETS insisted that these were curriculum-free and could not be studied for. Indeed they used to be called "Scholastic Aptitude Tests," because they were said to measure candidates' aptitude for higher level studies. The emphasis on predictive validity has become less; test formats include a wider range of question types aimed at eliciting more informative student responses; and the link with curriculum and standards is reflected in SAT II, a new subject test in high school subjects, for example English and biology. Fewer U.S. universities and colleges require SAT scores as part of their admission procedure in the early twenty-first century, though many still do.
Certification Combined with Selection
A number of countries (e.g., the United Kingdom, the Netherlands, Slovenia, and Lithuania) combine school-leaving examinations with university entrance examinations. Candidates typically take a national, curriculum-based, high school graduation exam in a range of subjects; the exams are set and marked (scored) by or under the control of a government department or a professional agency external to the schools; and candidates offer their results to universities as the main or sole basis for selection. Students take only one set of examinations, but the question papers and scoring methods are based on known standards and are nationally comparable, so that an "A" gained by a student in one part of the country is comparable to an "A" gained elsewhere. Universities are still entitled to set their own entrance requirements such as requiring high grades in biology, chemistry, and physics for students who wish to study medicine, or accepting lower grades in less popular disciplines or for admittance to less prestigious institutions.
Trends in Educational Policy: National Standards and Competence
Two main trends are evident worldwide. The first is a move towards examinations linked to explicit national (or state) standards, often tacitly aligned with international expectations, such as Organisation for Economic Co-Operation and Development (OECD) indicators or the results of multinational assessments such as the Third Mathematics and Science Study (TIMSS) or similar studies in reading literacy and civics. The second trend is towards a more competence-based approach to education in general, and to assessment in particular: less emphasis on what candidates can remember and more on what they understand and can do.
Standards. The term standards here refers to official, written guidelines that define what a country or state expects its state school students to know and be able to do as a result of their schooling.
In the United States all fifty states now have student testing programs, although the details vary widely. Few states, however, have testing programs that explicitly measure student achievement against state standards, despite claims that they do. (Some states ask external assessors to evaluate the alignment of tests to standards.) Standards are also used for school and teacher accountability purposes; about half the states rate schools primarily on the basis of student test scores or test score gains over time, and decisions to finance, close, take over, or otherwise overhaul chronically low-performing schools can be linked to student results. Much debate centers on whether tests designed for one purpose (measuring student learning) can fairly be used to judge teachers, schools, or education systems.
The same debate is heard in England and Wales, where student performance on national curriculum "key stage" testing at ages seven, eleven, fourteen, and sixteen has led to the publication of "league tables" listing schools in order of their students' performance. A painstaking attempt to arrive at a workable "value-added" formula that would take account of a number of social and educational variables ended in failure. The concept of "value added" involves linking a baseline assessment to subsequent performance: the term refers to relative progress of pupils or how well pupils perform compared to other pupils with similar starting points and background variables. A formula developed to measure these complex relationships was scientifically acceptable but judged too laborious for use by schools. Nevertheless, league tables are popular with parents and the media and remain a feature of standards-based testing in England and Wales.
Most countries in Central and Eastern Europe are likewise engaged in formulating educational standards, but standards still tend to be expressed in terms of content covered and hours on the timetable ("seat time") for each subject rather than student outcomes. When outcomes are mentioned, it is often in unmeasurable terms: "[Candidates] must be familiar with … the essence, purpose, and meaning of human life [and] … the correlation between truth and error" (State Committee for Higher Education, Russian Federation, p. 35).
Competence. The shift from content and "seat-time" standards to specifying desired student achievement expressed in operational terms ("The student will be able to … ") is reflected in new types of performance-based assessment where students show a range of skills as well as knowledge. Portfolios or coursework may be assessed as well as written tests. It has been argued that deconstructing achievement into a list of specified behaviors that can be measured misses the point: that learning is a subtle process that escapes formulas people seek to impose on it. Nevertheless, the realization that it is necessary to focus on outcomes (and not only on input and process) of education is an important step forward.
Apart from these conceptual shifts, many countries are also engaged in practical examinations reform. Seven common policy issues are: (1) changing concepts and techniques of testing (e.g., computer-based interactive testing on demand); (2) shift to standards-and competence-based tests; (3) changed test formats and question types (e.g., essays rather than multiple-choice); (4) more inclusive target levels of tests; (5) standardization of tests; (6) independent, external administration of tests; and (7) convergence of high school exit exams and university entrance.
Achieving Policy Goals
In terms of monitoring the achievement of education policy goals, standards-linked diagnostic and formative (national, whole-population, or sample-based) assessments at set points during a student's schooling are clearly more useful than scores on high-stakes summative examinations at the end. Trends in annual exam results can still be informative to policy makers, but they come too late for students themselves to improve performance. Thus the key-stage approach used in the United Kingdom provides better data for evidence-based policy making and more helpful information to parents than, for example, the simple numerical scores on SAT tests, which in any case are not systematically fed back to schools or education authorities.
However, the U.K. approach is expensive and labor-intensive. The best compromise might be sample-based periodic national assessments of a small number of key subjects (for policy purposes), plus a summative, curriculum-and standards-linked examination at the end of a major school cycle (for certification and selection).
See also: International Assessments; Standards for Student Learning; Standards Movement in American Education; Testing, subentries on International Standards of Test Development, National Achievement Tests, International.
Cresswell, Michael J. 1996. "Defining, Setting and Maintaining Standards in Curriculum-Embedded Examinations: Judgmental and Statistical Approaches." In Assessment: Problems, Developments and Statistical Issues, ed. Harvey Goldstein and Toby Lewis. London: Wiley and Sons.
Dore, Ronald P. 1997. The Diploma Disease: Education, Qualifications and Development. London: George Allen and Unwin.
Eckstein, Max A., and Noah, Harold J. eds. 1992. Examinations: Comparative and International Studies. Oxford: Pergamon Press.
Green, Andy. 1997. Education, Globalization and the Nation State. Basingstoke, Eng.: Macmillan.
Heyneman, Stephen P. 1987. "Uses of Examinations in Developing Countries: Selection, Research and Education Sector Management." International Journal of Education Development 7 (4):251–263.
Heyneman, Stephen P., and Fagerlind, Ingemar, eds. 1988. University Examinations and Standardized Testing. Washington, DC: World Bank.
Heyneman, Stephen P., and Ransom, Angela. 1990. "Using Examinations to Improve the Quality of Education." Educational Policy 4 (3):177–192.
Little, Angela, and Wolf, Alison, eds. 1996. Assessment in Transition: Learning, Monitoring and Selection in International Perspective. Tarrytown, NY: Pergamon.
School Curriculum and Assessment Authority (SCAA). 1997. The Value Added National Project: Final Report. London: SCAA Publications.
State Committee for Higher Education, Russian Federation. 1995. State Educational Standards for Higher Professional Education. Moscow: Ministry of Education.
Tymms, Peter. 2000. Baseline Assessment and Monitoring in Primary Schools: Achievements, Attitudes and Value-added Indicators. London: David Fulton.
University of Cambridge Local Examinations Syndicate (UCLES). 1998. MENO Higher-Level Thinking Skills Test Battery. Cambridge, Eng.: UCLES Research and Development Division.
West, Richard, and Crighton, Johanna. 1999. "Examination Reform in Central and Eastern Europe: Issues and Trends." Assessment in Education: Principles, Policy and Practice 6 (2):71–289.
National Governors Association. 2002. "High School Exit Exams: Setting High Expectations." <www.nga.org/center/divisions/1,1188,C_ISSUE_BRIEF%5ED_1478,00.html>.
National Center for Public Policy and Higher Education. 2002. "Measuring Up 2000." <http://measuringup2000.highereducation.org/>.
Johanna V. Crighton
STANDARDIZED TESTS AND HIGH-STAKES ASSESSMENT
Assessment is the process of collecting data to measure the knowledge or performance of a student or group. Written tests of students' knowledge are a common form of assessment, but data from homework assignments, informal observations of student proficiency, evaluations of projects, oral presentations, or other samples of student work may also be used in assessment. The word assessment carries with it the idea of a broader and more comprehensive evaluation of student performance than a single test.
In an age when testing is controversial, assessment has become the preferred term because of its connotation of breadth and thoroughness. The National Assessment of Educational Progress (NAEP) is an example of a comprehensive assessment worthy of the name. Also known as the "Nation's Report Card," NAEP administers achievement tests to a representative sample of U.S. students in reading, mathematics, science, writing, U.S. history, civics, geography, and the arts. The achievement measures used by NAEP in each subject area are so broad that each participating student takes only a small portion of the total assessment. Not all assessment programs, however, are of such high quality. Some administer much more narrow and limited tests, but still use the word assessment because of its popular appeal.
Standardized tests are tests administered and scored under a consistent set of procedures. Uniform conditions of administration are necessary to make it possible to compare results across individuals or schools. For example, it would be unfair if the performance of students taking a test in February were to be compared to the performance of students tested in May or if one group of students had help from their teacher while another group did not. The most familiar standardized tests of achievement are traditional machine-scorable, multiple-choice tests such as the California Achievement Test (CAT), the Comprehensive Tests of Basic Skills (CTBS), the Iowa Tests of Basic Skills (ITBS), the Metropolitan Achievement Test (MAT), and the Stanford Achievement Test (SAT). Many other assessments, such as open-ended performance assessments, personality and attitude measures, English-language proficiency tests, or Advanced Placement essay tests, may also be standardized so that results can be interpreted on a common scale.
High-stakes testing is a term that was first used in the 1980s to describe testing programs that have serious consequences for students or educators. Tests are high-stakes if their outcomes determine such important things as promotion to the next grade, graduation, merit pay for teachers, or school rankings reported in a newspaper. When test results have serious consequences, the requirements for evidence of test validity are correspondingly higher.
Purposes of Assessment
The intended use of an assessment–its purpose–determines every other aspect of how the assessment is conducted. Purpose determines the content of the assessment (What should be measured?); methods of data collection (Should the procedures be standardized? Should data come from all students or from a sample of students?); technical requirements of the assessment (What level of reliability and validity must be established?); and finally, the stakes or consequences of the assessment, which in turn determine the kinds of safeguards necessary to protect against potential harm from fallible assessment-based decisions.
In educational testing today, it is possible to distinguish at least four different purposes for assessment: (1) classroom assessment used to guide and evaluate learning; (2) selection testing used to identify students for special programs or for college admissions; (3) large-scale assessment used to evaluate programs and monitor trends; and (4) high-stakes assessment of achievement used to hold individual students, teachers, and schools accountable. Assessments designed for one of these purposes may not be appropriate or valid if used for another purpose.
In classrooms, assessment is an integral part of the teaching and learning process. Teachers use both formal and informal assessments to plan and guide instruction. For individual students, assessments help to gauge what things students already know and understand, where misconceptions exist, what skills need more practice in context, and what supports are needed to take the next steps in learning. Teachers also use assessment to evaluate their own teaching practices so as to adjust and modify curricula, instructional activities, or assignments that did not help students grasp key ideas. To serve classroom purposes, assessments must be closely aligned with what children are learning, and the timing of assessments must correspond to the specific days and weeks when children are learning specific concepts. While external accountability tests can help teachers examine their instructional program overall, external, once-per-year tests are ill-suited for diagnosis and targeting of individual student learning needs. The technical requirements for the reliability of classroom assessments are less stringent than for other testing purposes because assessment errors on any given day are readily corrected by additional information gathered on subsequent days.
Selection and placement tests may be used to identify students for gifted and talented programs, to provide services for students with disabilities, or for college admissions. Because selection tests are used to evaluate students with a wide variety of prior experiences, they tend to be more generic than standardized achievement tests so as not to presume exposure to a specific curriculum. Nonetheless, performance on selection measures is strongly influenced by past learning opportunities. Unlike IQ tests of the past, it is no longer assumed that any test can measure innate learning ability. Instead, measures of current learning and reasoning abilities are used as practical predictors of future learning; because all tests have some degree of error associated with them, professional standards require that test scores not be the sole determiner of important decisions. For example, college admissions tests are used in conjunction with high school grades and recommendations. School readiness tests are sometimes used as selection tests to decide whether children five years old should start school, but this is an improper use of the tests. None of the existing school readiness measures has sufficient reliability and validity to support such decisions.
Large-scale assessments, such as the National Assessment of Educational Progress (NAEP) or the Third International Mathematics and Science Survey (TIMSS), serve a monitoring and comparative function. Assessment data are gathered about groups of students in the aggregate and can be used by policymakers to make decisions about educational programs. Because there is not a single national or international curriculum, assessment content must be comprehensive and inclusive of all of the curricular goals of the many participating states or nations. Obviously, no one student could be expected to master all of the content in a test spanning many curricula, but, by design, individual student scores are not reported in this type of assessment. As a result, the total assessment can include a much broader array of tasks and problem types to better represent the content domain, with each student being asked to complete only a small sample of tasks from the total set. Given that important policy decisions may follow from shifts in achievement levels or international comparisons of achievement, large-scale assessments must meet high standards of technical accuracy.
High-stakes assessments of achievement that are used to hold individual students, teachers, and schools accountable are similar to large-scale monitoring assessments, but clearly have very different consequences. In addition, these tests, typically administered by states or school districts, must be much more closely aligned with the content standards and curriculum for which participants are being held accountable. As a practical matter, accountability assessments are often more limited in the variety of formats and tasks included, both because each student must take the same test and because states and districts may lack the resources to develop and score more open-ended performance measures. Regardless of practical constraints, high-stakes tests must meet the most stringent technical standards because of the harm to individuals that would be caused by test inaccuracies.
A Short History of High-Stakes Testing
Accountability testing in the United States started in 1965 as part of the same legislation (Title I of the Elementary and Secondary Education Act [ESEA]) that first allocated federal funds to improve the academic achievement of children from low-income families. Federal dollars came with a mandate that programs be evaluated to show their effectiveness. The early accountability movement did not assume, however, that public schools were bad. In fact, the idea behind ESEA was to extend the benefits of an excellent education to poor and minority children.
The public's generally positive view of America's schools changed with the famous SAT test score decline of the early 1970s. Despite the fact that a blueribbon panel commissioned by the College Board in 1977 later found that two-thirds to three-fourths of the score decline was attributable to an increase in the number of poor and minority students gaining access to college and not to a decline in the quality of education, all subsequent accountability efforts were driven by the belief that America's public schools were failing.
The minimum competency testing movement of the 1970s was the first in a series of educational reforms where tests were used not just as measures of the effectiveness of reforms, but also as the primary drivers of reform. Legislators mandated tests of minimum academic skills or survival skills (e.g., balancing a checkbook), intending to "put meaning back into the high school diploma." By 1980, thirty-seven states had taken action to mandate minimum competency standards for grade-to-grade promotion or high school graduation. It was not long, however, before the authors of A Nation at Risk (1983) concluded that minimum competency examinations were part of the problem, not part of the solution, because the "'minimum' [required of students] tends to become the 'maximum,' thus lowering educational standards for all" (p. 20).
Following the publication of A Nation at Risk, the excellence movement sought to ratchet up expectations by reinstating course-based graduation requirements, extending time in the school day and school year, requiring more homework, and, most importantly, requiring more testing. Despite the rhetoric of rigorous academic curricula, the new tests adopted in the mid-1980s were predominantly multiple-choice, basic-skills tests–a step up from minimum competency tests, but not much of one. By the end of the 1980s, evidence began to accrue showing that impressive score gains on these tests might not be a sign of real learning gains. For example, John Cannell's 1987 study, dubbed the "Lake Wobegon Report," showed that all fifty states claimed their test scores were above the national average.
Standards-based reforms, which began in the 1990s and continued at the start of the twenty-first century, were both a rejection and extension of previous reforms. Rejecting traditional curricula and especially rote activities, the standards movement called for the development of much more challenging curricula, focused on reasoning, conceptual understanding, and the ability to apply one's knowledge. At the same time, the standards movement continued to rely heavily on large-scale accountability assessments to leverage changes in instruction. However, standards-based reformers explicitly called for a radical change in the content and format of assessments to forestall the negative effects of "teaching the test." Various terms, such as authentic, direct, and performance-based, were used in standards parlance to convey the idea that assessments themselves had to be reformed to more faithfully reflect important learning goals. The idea was that if tests included extended problems and writing tasks, then it would be impossible for scores to go up on such assessments without there being a genuine improvement in learning.
Effects of High-Stakes Testing
By the end of the 1980s, concerns about dramatic increases in the amount of testing and potential negative effects prompted Congress to commission a comprehensive report on educational testing. This report summarized research documenting the ill effects of high-pressure accountability testing, including that high-stakes testing led to test score inflation, meaning that test scores went up without a corresponding gain in student learning. Controlled studies showed that test score gains on familiar and taught-to tests could not be verified by independent tests covering the same content. High-stakes testing also led to curriculum distortion, which helped to explain how spurious score gains may occur. Interview and survey data showed that many teachers eliminated science and social studies, especially in high-poverty schools, because more time was needed for math and reading. Teaching to the test also involved rote drill in tested subjects, so that students were unable to use their knowledge in any other format.
It should also be noted that established findings from the motivational literature have raised serious questions about test-based incentive systems. Students who are motivated by trying to do well on tests, instead of working to understand and master the material, are consistently disadvantaged in subsequent endeavors. They become less intrinsically motivated, they learn less, and they are less willing to persist with difficult problems.
To what extent do these results, documented in the late 1980s, still hold true for standards-based assessments begun in the 1990s? Recent studies still show the strong influence that high-stakes tests have on what gets taught. To the extent that the content of assessments has improved, there have been corresponding improvements in instruction and curriculum. The most compelling evidence of positive effects is in the area of writing instruction. In extreme cases, writing has been added to the curriculum in classrooms, most often in urban settings, where previously it was entirely absent.
Unfortunately, recent studies on the effects of standards-based reforms also confirm many of the earlier negative effects of high-stakes testing. The trend to eliminate or reduce social studies and science, because state tests focused only on reading, writing, and mathematics, has been so pervasive nationwide that experts speculate it may explain the recent downturn in performance in science on NAEP. In Texas, Linda McNeil and Angela Valenzuela found that a focus on tested content and test-taking skills was especially pronounced in urban districts.
In contrast with previous analysts who used test-score gains themselves as evidence of effectiveness, it is now widely understood by researchers and policymakers that some independent confirmation is needed to establish the validity of achievement gains. For example, two different studies by researchers at the RAND Corporation used NAEP as an independent measure of achievement gains and documented both real and spurious aspects of test-score gains in Texas. A 2000 study by David Grissmer, Ann Flanagan, Jennifer Kawata, and Stephanie Williamson found that Texas students performed better than expected based on family characteristics and socioeconomic factors. However, a study by Stephen Klein and colleagues found that gains on NAEP were nothing like the dramatic gains reported on Texas's own test, the Texas Assessment of Academic Skills (TAAS). Klein et al. also found that the gap in achievement between majority and minority groups had widened for Texas students on NAEP whereas the gap had appeared to be closing on the TAAS. Both of these studies could be accurate, of course. Texas students could be learning more in recent years, but not as much as claimed by the TAAS. Studies such as these illustrate the importance of conducting research to evaluate the validity and credibility of results from high-stakes testing programs.
Professional Standards for High-Stakes Testing
The Standards for Educational and Psychological Testing (1999) is published jointly by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. The Standards establish appropriate procedures for test development, scaling, and scoring, as well as the evidence needed to ensure validity, reliability, and fairness in testing. Drawing from the Standards, the American Educational Research Association issued a position statement in 2000 identifying the twelve conditions that must be met to ensure sound implementation of high-stakes educational testing programs.
First, individual students should be protected from tests being used as the sole criterion for critically important decisions. Second, students and teachers should not be sanctioned for failing to meet new standards if sufficient resources and opportunities to learn have not been provided. Third, test validity must be established for each separate intended use, such as student certification or school evaluation. Fourth, the testing program must fully disclose any likely negative side effects of testing. Fifth, the test should be aligned with the curriculum and should not be limited to only the easiest-to-test portion of the curriculum. Sixth, the validity of passing scores and achievement levels should be analyzed, as well as the validity of the test itself. Seventh, students who fail a high-stakes test should be provided with meaningful opportunities for remediation consisting of more than drilling on materials that imitate the test. Eighth, special accommodations should be provided for English language learners so that language does not interfere with assessment of content area knowledge. Ninth, provision should be made for students with disabilities so that they may demonstrate their proficiency on tested content without being impeded by the format of the test. Tenth, explicit rules should be established for excluding English language learners or students with disabilities so that schools, districts, or states cannot improve their scores by excluding some students. Eleventh, test results should be sufficiently reliable for their intended use. Twelfth, an ongoing program of research should be established to evaluate both the intended and unintended consequences of high-stakes testing programs.
Professional standards provide a useful framework for understanding the limitations and potential benefits of sound assessment methodologies. Used appropriately tests can greatly enhance educational decision-making. However, when used in ways that go beyond what tests can validly claim to do, tests could very likely do more harm than good.
See also: Assessment; International Assessments; Standards for Student Learning; Standards Movement in American Education.
American Educational Research Association; American Psychological Association; and National Council on Measurement in Education. 1999. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Beatty, Alexandra; Greenwood, M. R. C.; and Linn, Robert L., eds. 1999. Myths and Tradeoffs: The Role of Tests in Undergraduate Admissions. Washington, DC: National Academy Press.
Cannell, John J. 1987. Nationally Normed Elementary Achievement Testing in America's Public Schools: How All 50 States Are Above the National Average. Daniels, WV: Friends for Education.
Cannell, John J. 1989. The Lake Wobegon Report: How Public Educators Cheat on Achievement Tests. Albuquerque, NM: Friends for Education.
College Board. 1977. On Further Examination: Report of the Advisory Panel on the Scholastic Aptitude Test Score Decline. New York: College Board.
Grissmer, David; Flanagan, Ann; Kawata, Jennifer; and Williamson, Stephanie. 2000. Improving Student Achievement: What State NAEP Test Scores Tell Us. Santa Monica, CA: RAND.
Klein, Stephen P., et al. 2000. What Do Test Scores in Texas Tell Us? Santa Monica, CA: RAND.
McNeil, Linda, and Valenzuela, Angela. 2000. The Harmful Impact of the TAAS System of Testing in Texas: Beneath the Accountability Rhetoric. Cambridge, MA: Harvard University Civil Rights Project.
National Commission on Excellence in Education. 1983. A Nation at Risk: The Imperative of Educational Reform. Washington, DC: U.S. Department of Education.
Stipek, Deborah. 1998. Motivation to Learn: From Theory to Practice, 3rd edition. Boston: Allyn and Bacon.
U.S. Congress, Office of Technology Assessment. 1992. Testing in American Schools: Asking the Right Questions. Washington, DC: U.S. Government Printing Office.
American Educational Research Association. 2000. "AERA Position Statement Concerning High-Stakes Testing in Pre-K–12 Education." <www.aera.net/about/policy/stakes.htm>.
National Association for the Education of Young Children. 1995. "Position Statement on School Readiness." <www.naeyc.org/resources/position_statements/psredy98.htm>.
Lorrie A. Shepard
STATEWIDE TESTING PROGRAMS
State testing programs have a long history. New York State administered its first Regents' Examinations as early as 1865. Several other state programs had their beginnings in the 1920s, when new forms of achievement examinations–objective tests–were developed for and introduced in the schools. In 1937 representatives from fifteen state programs and nonprofit testing agencies met, under the leadership of the American Council on Education's Committee on Measurement and Guidance, to discuss common problems. The group continued to meet annually, except during World War II, for decades.
In 1957 President Dwight D. Eisenhower called attention to state testing programs when he indicated the need for nationwide testing of high school students and a system of incentives for qualified students to pursue scientific or professional careers. The subsequent passage of the National Defense Education Act of 1958 (NDEA) not only encouraged but gave financial support to testing, guidance, and scholarship programs.
The growth in number and importance of state testing programs accelerated rapidly in the 1970s and has continued to grow ever since. The growth in the 1970s reflected, at least in part, the enhanced role of states in education policy. In the academic year 1969–1970, the majority–53 percent–of the funding for schools came from local agencies; states contributed 39 percent and the federal government provided 8 percent. A decade later, however, the state share of education funding had increased to 47 percent, and state governments became the dominant source of funding for schools.
With that increased responsibility came demands for some form of accountability for results, and that meant state tests to determine if students were learning. At the same time, there was growing concern among public officials and the public about the quality of schools, fueled in part by the revelation that average scores on the SAT declined between 1963 and 1977. In response to these concerns, and the interest in accountability, a majority of states in the 1970s implemented some form of minimum competency test, which students were required to pass in order to graduate from high school. The number of states conducting such tests rose from a handful in 1975 to thirty-three in 1985.
The wave of state education reforms enacted following the publication in 1983 of A Nation at Risk, the report by the National Commission on Excellence in Education, further accelerated the growth in state testing. That report, which warned of a "rising tide of mediocrity" in America's schools, recommended that states adopt achievement tests to measure student performance, and many states responded to the call. By the end of the 1980s, forty-seven states were operating at least one testing program, up from thirty-nine in 1984.
This growth continued throughout the 1990s as well. The dominant role of states in education policy was symbolized near the beginning of that decade, when President George H. W. Bush called the nation's governors to an extraordinary "education summit" in Charlottesville, Virginia. In the wake of that meeting, the President and the governors agreed to a set of national education goals, which included the pledge that all students would be "competent in challenging subject matter" by the year 2000. The goals were enshrined into federal law in 1994; that same year, President Bill Clinton signed the Improving America's Schools Act, which required states to set challenging standards for student performance and implement tests that measure student performance against the standards. In response to the law, nearly all states revamped their existing tests or developed new tests, and as of 2001, all states except Iowa (where local school districts administer tests) had a statewide testing program; by one estimate, the amount spent by states on testing doubled, to $410 million, between 1996 and 2001.
The No Child Left Behind Act, which President George W. Bush signed into law in 2002, requires a significant increase in state testing. Under the law, states must administer annual reading and mathematics tests in grades three through eight, tests in science in at least three grade levels, and tests at the high school level. At the time of enactment, only nine states met the law's requirements for annual tests in reading and mathematics aligned to state standards.
Types of Tests
Although most state tests consist primarily of multiple-choice questions, there is considerable variation among the states. Thirty-four states include some short-answer questions in at least some of their tests, requiring students to write answers rather than select from among answers already provided, and eighteen states include questions requiring extended responses in subjects other than English language arts. (Nearly all states administer writing tests that ask for extended responses.) Two states, Kentucky and Vermont, assess student performance in writing through the use of portfolios, which collect students' classroom work during the course of a school year. The portfolios are scored by teachers, using common criteria.
The Maryland state test, the Maryland School Performance Assessment Program (MSPAP), is unusual in that it consists exclusively of open-ended questions. Students work in groups for part of the assessment, and many of the questions are interdisciplinary, requiring students to apply knowledge from English language arts, mathematics, science, and social studies. In addition, the test is designed so that individual students take only a third of the complete assessment; as a result, scores are reported for schools, school districts, and the state, but not for individual students.
The MSPAP, like many state tests, was custommade to match the state's standards. In Maryland's case, the test was developed by state teachers; other states contract with commercial publishers to develop tests to match their standards. Such tests indicate the level of performance students attained, but do not permit comparisons with student performance from other states. The types of reports vary widely. Maryland, for example, specifies three levels of achievement: excellent, indicating outstanding accomplishment; satisfactory, indicating proficiency; and not met, indicating more work is required to attain proficiency. The state's goal is for 70 percent of students to reach the satisfactory level and 25 percent to reach the excellent level.
Other states, meanwhile, use commercially available tests that provide comparative information. The most commonly used tests are the Stanford Achievement Test, 9th Edition (or SAT-9), published by Harcourt Brace Educational Measurement, and the Terra Nova, published by CTB-McGraw-Hill. These tests are administered to representative samples of students, known as a norm group, and provide information on student performance compared with the norm group. For example, results might indicate that a student performed in the sixty-fifth percentile, meaning that the student performed better than sixty-five percent of the norm group. To provide both information on performance against standards and comparative information, some states employ hybrid systems. Maryland, for example, administers a norm-referenced test in grades in which the MSPAP is not used. Delaware, meanwhile, has embedded an abbreviated version of the SAT-9 within its state test.
State tests are used for a variety of purposes. The most common is to provide information to parents and the public about student, school, and school system performance. The expectation is that this information can improve instruction and learning by pointing out areas of weakness that need additional attention.
In addition to providing reports to parents about their children's performance, forty-three states issue "report cards" on schools that indicate school performance; twenty states require that these report cards be sent to parents. The No Child Left Behind Act requires all states to produce school report cards and to disseminate them to parents.
States also place consequences for students on the results of tests. As of 2001 four states (Delaware, Louisiana, New Mexico, and North Carolina) make promotion for at least certain grades contingent on passing state tests, and another four states are planning to do so by 2004. In other states, school districts set policies for grade-to-grade promotion, and many districts use state tests as criteria for determining promotion from grade to grade. A 1997 survey of large school districts, conducted by the American Federation of Teachers, found that nearly 40 percent of the districts surveyed used standardized tests in making promotion decisions at the elementary school level, and 35 percent used tests in making such decisions at the middle school level. Although the survey did not indicate which tests the districts used, the report noted that statewide tests were among them.
More commonly states use tests as criteria for high school graduation. Seventeen states, as of 2001, make graduation from high school contingent on passing state tests, and another seven are expected to do so by 2008. These numbers are similar to those recorded in the early 1980s at the height of the minimum-competency era. Yet the graduation requirements first implemented in the late 1990s are different than those of the earlier period because the tests are different. Unlike the previous generation of tests, which measured basic reading and mathematical competencies, many of the newer tests tend to measure more complex skills along with, in many cases, knowledge and skills in science, social studies, and other subjects.
In most states with graduation test requirements, the tests are administered in the tenth or eleventh grade, and students typically have multiple opportunities to take the tests before graduation. In some states, such as New York, Tennessee, and Virginia, the graduation tests are end-of-course tests, meaning they measure a particular course content (such as algebra or biology) and are administered at the completion of the course.
States also use tests to reward high-performing students. For most of its existence, the New York State Regents' Examination was an optional test; students who took the test and passed earned a special diploma, called a Regents' Diploma. Beginning in 2000, however, the state required all students to take the examinations. Other states, such as Connecticut, continue to use tests to award special diplomas to students who pass them.
Some states, such as Michigan, provide scholarships for students who perform well on state tests. There, the state awards $2,500 scholarships to students attending Michigan colleges and universities who score in the top level of all four high school tests–mathematics, reading, science, and writing (the scholarships are worth $1,000 for students attending out-of-state institutions).
In addition to the consequences for students, states also use statewide tests to determine consequences for schools. As of 2001 thirty states rate schools based on performance, and half of those states use test scores as the sole measure of performance (the others use indicators such as graduation and attendance rates, in addition to test scores). In Texas, for example, the state rates each school and school district in one of four categories, based on test performance: exemplary, recognized, acceptable/academically acceptable, and low-performing/academically unacceptable. To earn a rating of exemplary, at least 90 percent of students–and 90 percent of each group of students (white, African American, Hispanic, and economically disadvantaged)–must pass the state tests in each subject area. Those with at least an 80 percent pass rate, overall and for each group, are rated recognized; those with a 55 percent pass rate are rated acceptable.
In eighteen states high-performing schools can earn rewards. In some cases, such schools earn recognition from the state. In North Carolina, for example, the twenty-five elementary and middle schools and ten high schools that register the highest level of growth in performance on state tests, along with those in which 90 percent of students perform at or above grade level, receive a banner to hang in the school and an invitation to a state banquet in their honor. Schools and teachers in high-performing or rapidly improving schools also receive cash awards in some states. In California, for example, 1,000 teachers in the schools with the largest gains on state tests receive a cash bonus of $25,000 each. Another 3,750 teachers receive $10,000 each, and another 7,500 receive $5,000 each. Schools that demonstrate large gains receive awards of $150 per pupil.
States also use test results to intervene in low-performing schools. In 2001 twenty-eight states provide assistance to low-performing schools. In most cases, such assistance includes technical assistance in developing improvement plans and financial assistance or priority for state aid. In some cases, such as in North Carolina and Kentucky, the state sends teams of expert educators to work intensively with schools that perform poorly on state tests. These experts help secure additional support and can recommend changes, such as replacing faculty members.
Twenty states also have the authority to levy sanctions on persistently low-performing schools. Such sanctions include withholding funds, allowing students to transfer to other public schools, "reconstitution," which is the replacing of the faculty and administration, and closure. However, despite this authority, few states have actually imposed sanctions. One that did so was Maryland, where in 2000 the state turned over the management of three elementary schools in Baltimore to a private firm.
The No Child Left Behind Act of 2001 contains a number of provisions to strengthen the role of state tests in placing consequences on schools. Under the law, states are required to set a target for proficiency that all students are expected to reach within twelve years, and to set milestones along the way for schools to reach each year. Schools that fail to make adequate progress would be subject to sanctions, such as allowing students to transfer to other public schools, allowing parents to use funds for supplemental tutoring services, or reconstitution.
Effects of Statewide Testing
The rapid growth in the amount and importance of statewide testing since the 1970s has sparked intense scrutiny about the effects of the tests on students and on classroom practices.
Much of the scrutiny has focused on the effects of tests on students from minority groups who tend to do less well on tests than white students. In several cases, advocates for minority students have challenged tests in court, charging that the testing programs were discriminatory. In one well-known case, African-American high school students in Florida in the 1970s challenged that state's high school graduation test–on which the failure rate for African Americans was ten times the rate for white students–on the grounds that the African Americans had attended segregated schools for years and that denying them diplomas for failing a test preserved the effects of segregation. In its ruling, the U.S. Court of Appeals for the Fifth Circuit upheld the use of the test, but said the state could not withhold diplomas from African-American students until students had no longer attended segregated schools (Debra P. v. Turlington ).
The court also held that students have a property right to a diploma, but that Florida could use a test to award to diplomas provided that students have adequate notice of the graduation requirement (four years, in that case) and that the test represents a fair measure of what is taught. Although the decision applied only to the states in the Fifth Circuit, the standards the court applied have been cited by other courts and other states since then.
In addition to the legal challenges, state tests have also come under scrutiny for their effects on student behavior–specifically, on the likelihood that students at risk of failing will drop out of school. There is some evidence that dropout rates are higher in states with graduation-test requirements. But it is unclear whether the tests caused the students to decide to drop out of school.
Many testing professionals have also expressed concern that the use of tests to make decisions like promotion or graduation may be inappropriate. Because test scores are not precise measures of a student's knowledge and skills, test professionals warn that any important decision about a student should not rest on a single test score; other relevant information about a student's abilities should be taken into account.
There has been a great deal of research on the effects of state tests on instruction. Since a primary purpose of the state tests is to provide information to improve instruction and learning, this research has been closely watched. The studies have generally found that tests exert a strong influence on classroom practice, but that this influence was not always salutary. On the positive side, the studies found that tests, particularly those with consequences attached to the results, focused the attention of students and educators on academic performance and created incentives for students and teachers to raise test scores. In addition, tests also encouraged teachers to focus on aspects of the curriculum that may have been underrepresented. For example, in states like California that introduced writing tests that assessed students' written prose (as opposed to multiple-choice tests that measured writing abilities indirectly), teachers tended to spend more time asking students to write in class and exposing them to a broader range of writing genres.
On the negative side, studies also found that state tests often encouraged teachers to focus on the material on the test at the expense of other content that may be worthwhile. In states that used exclusively multiple-choice basic skills tests, researchers found that many teachers–particularly those who taught disadvantaged students–spent a great deal of class time on drill and practice of low-level skills, as opposed to instruction on more complex abilities that the tests did not assess. At the same time, researchers found that in some cases teachers devoted a greater proportion of time to tested subjects and less to subjects not tested, like history and the arts, and that teachers spent class time on test-preparation strategies rather than instruction in academic content. The heavy influence of tests on instruction has led some commentators to question whether gains in test scores represent genuine improvements in learning or simply "teaching to the test."
See also: States and Education; Testing, subentry on Standardized Tests and Educational Policy.
Education Week. 2002. "Quality Counts 2002: Building Blocks for Success." Education Week 21 (17):January 10.
Elmore, Richard F., and Rothman, Robert, eds. 1999. Testing, Teaching, and Learning: A Guide for States and School Districts. Committee on Title I Testing and Assessment, National Research Council. Washington, DC: National Academy Press.
Heubert, Jay P., and Hauser, Robert M., eds. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. National Research Council, Committee on Appropriate Test Use. Washington, DC: National Academy Press.
Linn, Robert L., and Herman, Joan L. 1997. A Policymaker's Guide to Standards-Led Assessment. Denver, CO: Education Commission of the States.
Office of Technology Assessment. 1992. Testing in America's Schools: Asking the Right Questions. Washington, DC: U.S. Government Printing Office.
Ravitch, Diane. 1995. National Standards in American Education: A Citizen's Guide. Washington, DC: Brookings Institution Press.
Rothman, Robert. 1995. Measuring Up: Standards, Assessment, and School Reform. San Francisco: Jossey-Bass.
TEST PREPARATION PROGRAMS, IMPACT OF
Test preparation programs share two common features. First, students are prepared to take one specific test and second, the preparation students receive is systematic. Test preparation programs have been found to have differing degrees of the following characteristics:
- Instruction that develops the skills and abilities that are to be tested.
- Practice on problems that resemble those on the test.
- Instruction and practice with test-taking strategies.
Short-term programs that include primarily this third characteristic are often classified as coaching.
There are various shades of gray in this definition. First, the difference between short-term and long-term programs is difficult to quantify. Some researchers have used roughly forty to forty-five hours of student contact time as a threshold, but this amount is not set in stone. Second, within those programs classified as short-term or long-term, the intensity of preparation may differ. One might speculate that a program with twenty hours of student contact time spaced over one week is quite different from one with twenty hours spaced over one month. Finally, test preparation programs may include a mix of the three characteristics listed above. It is unclear what proportion of student contact time must be spent on test-taking strategies before a program can be classified as coaching.
There is little research base from which to draw conclusions about the effectiveness of preparatory programs for achievement tests given to students during their primary and secondary education. This may change as these tests are used for increasingly high-stakes purposes, particularly in the United States. There is a much more substantial research base with regard to the effectiveness of test preparation programs on achievement tests taken for the purpose of postsecondary admissions. The remainder of this review will focus on the effectiveness of this class of programs.
The impact of commercial test preparation programs is a very controversial topic. The controversy hinges in large part upon how program impacts are quantified. Students taking commercial programs are usually given some sort of pretest, a period of preparatory training, and then a posttest. The companies supplying these services will typically quantify impact in terms of average score gains for students from pretest to posttest. Conversely, most research on the subject of test preparation will quantify impact as the average gain of students in the program relative to the average gain of a comparable group of students not in the program. Here the impact of a test preparation program is equal to its estimated causal effect. The latter definition of impact is the more valid one when the aim is to evaluate the costs and benefits of a test preparation program. Program benefits should always be expressed relative to the outcome a person could expect had she chosen not to participate in the program. This review takes the approach that program impact can only be assessed by estimating program effects.
Program Effects on College Admissions Tests
Research studies on the effects of admissions test preparation programs have been published periodically since the early 1950s. Most of this research has been concerned with the effect of test preparation on the SAT, the most widely taken test for the purpose of college admission in the United States. A far smaller number of studies have considered the preparation effect on the ACT, another test often required for U.S. college admission. On the main issues, there is a strong consensus in the literature:
- Test preparation programs have a statistically significant effect on the changes of SAT and ACT scores for students taking the test at least twice.
- The magnitude of this effect is relatively small.
How small? The SAT consists of two sections, one verbal and one quantitative, each with a score range from 200 to 800 points. While the section averages of all students taking the SAT each year varies slightly, the standard deviation around these averages is pretty consistently about 100 points. The average effect of test preparation programs on the verbal section of the SAT is probably between 5 and 15 points (.05 and.15 of a standard deviation); the average effect on the quantitative section is probably between 15 and 25 points (.15 and.25 of a standard deviation). The largest effects found in a published study of commercial SAT preparation reported estimates of about 30 points per section. Some unpublished studies have found larger effects, but these have involved very small sample sizes or methodologically flawed research designs.
The ACT consists of four sections: science, math, English, and reading. Scores on each section range from 1 to 36 points and test-takers are also given a composite score for all four sections on the same scale. The standard deviation on the test is usually about 5 points. A smaller body of research, most of it unpublished, has found a test preparation effect (expressed as a percentage of the 5 point standard deviation) of.02 on the composite ACT score, an effect of about.04 to.06 on the math section, an effect of about.08 to.12 on the English section, and a negative effect of.12 to.14 on the reading section.
The importance of how test preparation program effects are estimated cannot be overstated. In most studies, researchers are presented with a group of students who participate in a preparatory program in order to improve their scores on a test they have already taken once. Estimating the effect of the program is a question of causal inference: How much does exposure to systematic test preparation cause a student's test score to increase above the amount it would have increased without exposure to the preparation? Estimating this causal effect is quite different from calculating the average score gain of all students exposed to the program. A number of commercial test preparation companies have advertised–even guaranteed–score gains based on the average gains calculated from students previously participating in their program. This confuses gains with effects. To determine if the program itself has a real effect on test scores one must contrast the gains of the treatment group (students who have taken the program) to the gains of a comparable control group (students who have not taken the program). This would be an example of a controlled study and is the principle underlying the estimation of effects for test preparation programs.
One vexing problem is how to interpret the score gains of students participating in preparatory programs in the absence of a control group. Some researchers have attempted to do this by subtracting the average gain of students participating in a program from the expected gain of the full test-taking population over a given time period. This has been criticized primarily on the grounds that the former group is never a randomly drawn sample from the latter. Students participating in test preparation are in fact often systematically different from the full test-taking population along important characteristics that are correlated with test performance–for example, household income, parental education, and student motivation. This is an example of self-selection bias. Due in part to this bias, uncontrolled studies have been found to consistently arrive at estimates for the effect of test preparation programs that are as much as four to five time greater than those found in controlled studies.
Is Test Preparation More Effective under Certain Programs for Certain People?
In a comprehensive review of studies written between 1950 and 1980 that estimate an effect for test preparation, Samuel Messick and Ann Jungeblut (1981) found evidence of a positive relationship between time spent in a program and the estimated effect on SAT scores. But this relationship was not linear; there were diminishing returns to SAT score changes for time spent in a program beyond 45 hours. Messick and Jungeblut concluded that "the student contact time required to achieve average score increases much greater than 20 to 30 points for both the SAT-V and SAT-M rapidly approaches that of full-time schooling" (p. 215). Since the Messick and Jungeblut review, several reviews of test preparation studies have been written by researchers using the statistical technique known as meta-analysis. Use of this technique allows for the synthesis of effect estimates from a wide range of studies conducted at different points in time. The findings from these reviews suggest that there is little systematic relationship between the observable characteristics of test preparation programs and the estimated effect on test scores. In particular, once a study's quality was taken into consideration, there was at best a very weak association between program duration and test score improvement.
There is also mixed evidence as to whether test preparation programs are more effective for particular subgroups of test-takers. Many of the studies that demonstrate interactions between the racial/ethnic and socioeconomic characteristics of test-takers and the effects of test preparation suffer from very small and self-selected samples. Because commercial test preparation programs charge a fee, sometimes a substantial one, most students participating in such programs tend to be socioeconomically advantaged. In one of the few studies with a nationally representative sample of test-takers, test preparation for the SAT was found to be most effective for students coming from high socioeconomic backgrounds. A similar association was not found among students who took a preparatory program for the ACT.
The differential impact of test preparation programs on student subgroups is an area that merits further research. In addition, a theory describing why the pedagogical practices used within commercial test preparatory programs in the twenty-first century would be expected to increase test scores has, with few exceptions, not been adequately explicated or studied in a controlled setting at the item level. In any case, after more than five decades of research on the issue, there is little doubt that commercial preparatory programs, by advertising average score gains without reference to a control group, are misleading prospective test-takers about the benefits of their product. The costs of such programs are high, both in terms of money and in terms of opportunity. For consumers of test preparation programs, these benefits and costs should be weighed carefully.
See also: College Admissions Tests.
Becker, Betsy Jane. 1990. "Coaching for the Scholastic Aptitude Test: Further Synthesis and Appraisal." Review of Educational Research 60 (3):373–417.
Bond, Lloyd. 1989. "The Effects of Special Preparation on Measures of Scholastic Ability." In Educational Measurement, ed. Robert L. Linn. New York: American Council on Education and Macmillan.
Briggs, Derek C. 2001. "The Effect of Admissions Test Preparation: Evidence from NELS:88." Chance 14 (1):10–18.
Cole, Nancy. 1982. "The Implications of Coaching for Ability Testing." In Ability Testing: Uses, Consequences, and Controversies. Part II: Documentation Section, ed. Alexandra K. Wigdor and Wendell R. Gardner. Washington, DC: National Academy Press.
DerSimonian, Roberta, and Laird, Nancy M. 1983. "Evaluating the Effect of Coaching on SAT Scores: A Meta-Analysis." Harvard Educational Review 53:1–15.
Evans, Franklin, and Pike, Lewis. 1973. "The Effects of Instruction for Three Mathematics Item Formats." Journal of Educational Measurement 10 (4):257–272.
Jackson, Rex. 1980. "The Scholastic Aptitude Test: A Response to Slack and Porter's 'Critical Appraisal."' Harvard Educational Review 50 (3):382–391.
Kulik, James A.; Bangert-Drowns, Robert L.; and Kulik, Chen-lin. 1984. "Effectiveness of Coaching for Aptitude Tests." Psychological Bulletin 95:179–188.
Messick, Samuel, and Jungeblut, Ann. 1981. "Time and Method in Coaching for the SAT." Psychological Bulletin 89:191–216.
Powers, Donald. 1986. "Relations of Test Item Characteristics to Test Preparation/Test Practice Effects: A Quantitative Summary." Psychological Bulletin 100 (1):67–77.
Powers, Donald. 1993. "Coaching for the SAT: A Summary of the Summaries and an Update." Educational Measurement: Issues and Practice (summer): 24–39.
Powers, Donald, and Rock, Don. 1999. "Effects of Coaching on SAT I: Reasoning Test Scores." Journal of Educational Measurement 36 (2):93–118.
Senowitz, Michael; Bernhardt, Kenneth; and Knain, D. Matthew. 1982. "An Analysis of the Impact of Commercial Test Preparation Courses on SAT Scores." American Educational Research Journal 19 (3):429–441.
Slack, Warner V., and Porter, Douglass. 1980. "The Scholastic Aptitude Test: A Critical Appraisal." Harvard Education Review 50:54–175.
Derek C. Briggs
NATIONAL ACHIEVEMENT TESTS, INTERNATIONAL
National testing of elementary and secondary students exists in most industrialized countries. Each country's national examinations are based on national curricula and content standards. The difference in weight and consequence of the exams varies tremendously from country to country, as does the use of exams at various levels of education. In France, Germany, Great Britain, Italy, and Japan, examinations at the lower secondary or elementary level are required for admission to academic secondary schools. In the United States, there is a more informal system of tracking students into academic, vocational, or general studies within the same secondary school. In Japan and Korea secondary students take nationally administered examinations that determine their postsecondary placement. Topscoring students attend the most prestigious public universities. In France the baccalaureate examinations are given to students at academic secondary schools (the lycée ) as exit examinations and also to determine university placement. In Germany a similar distinction is made between academic and vocational secondary school, and passing the Abitur (exit examination from Gymnasium or academic secondary school) allows students to continue on to university-level coursework. In Great Britain students study for their A-level examinations for university placement. Finally, in Italy students must pass exit examinations at both the lower-and upper-secondary levels. At the secondary level, the Esami di maturita impacts university attendance or employment.
In the United States the SAT (once called either the Scholastic Assessment Test or the Scholastic Aptitude Test) and the ACT (the American College Test) are required for entrance into the more prestigious and rigorous colleges and universities. This system closely mirrors that of Japan, Korea, France, and Germany with the exception of the lack of federal governing. However, the SAT and ACT are not meant to monitor student performance over time and are not national tests of student performance, as are the aforementioned tests specific to other countries.
In contrast to the nations described above, the United States has no national system of education and as such no national or federal assessment that has an impact on students at the individual level. The system of public education in the United States is characterized by a high degree of decentralization. The Tenth Amendment to the U.S. Constitution prohibits direct federal government involvement in education; the provision of education is a state-level function. States and local school districts are the entities charged with policy and curriculum decisions. Nonetheless, there has been a trend toward increased measures of national performance for the past three to four decades. Although the United States has no national curriculum, given the truly global nature of society, curricula are converging in similarity within the United States as well as internationally.
Current Trends in Educational Assessment
In 1965 Congress passed the Elementary and Secondary Education Act, which authorized federal support for the education of disadvantaged and handicapped children. The Elementary and Secondary Education Act (ESEA) made clear the federal government's commitment to equal education for all by mandating fairness criteria for disadvantaged and minority students. With ESEA standardized testing became entrenched in American education, as it required regular testing in schools that receive federal funding for disadvantaged students.
A major surge toward education reform can be linked to the 1983 Department of Education sponsored report, A Nation at Risk. While the National Assessment of Educational Progress (NAEP) long-term assessment showed gains for every age group between 1971 and 1992, declining SAT scores drew national attention with pronouncements that the education system in the United States was failing its students and society as a whole.
In 1989 President George Bush and the nation's governors established a set of six national education goals to be achieved by all students in the United States by the year 2000. These six goals plus two more were enacted into law as part of President Bill Clinton's Goals 2000: Educate America Act (1994). The goals stated that by the year 2000, all students would come to school ready to learn, high school graduation rates would increase, students would demonstrate competency in challenging subject matter and would be prepared for life-long learning, U.S. students would be first in the world in science and mathematics, American adults would be literate and productive citizens, schools would be drug-and violence-free, teachers would be more prepared, and parental involvement in education would rise. The education goals set forth by President George Bush and furthered in strength and number by President Bill Clinton with the Educate America Act increased the demands on accountability systems in education at both the state and national levels.
The National Assessment of Educational Progress had been monitoring the nation's student achievement for many years. The new reforms and new goals required more of the NAEP than it could provide. In 1994 NAEP responsibilities and breadth were extended with the reauthorization of NAEP through the Improving America's Schools Act of 1994. As the National Research Council states in its 1999 publication High Stakes: Testing for Tracking, Promotion, and Graduation, "recent education summits, national and local reform efforts, the inception of state NAEP and the introduction of performance standards have taken NAEP from a simple monitor of student achievement–free from political influence and notice–into the public spotlight" (p. 25). Linking federal funding with the development of performance-based standards and assessments and accountability only enhanced the attention and emphasis place on the NAEP.
In 2001 President George W. Bush proposed tying federal education dollars to specific performance-based initiatives including mandatory participation in annual state NAEP assessments in reading and mathematics and high-stakes accountability at the state level. The emphasis in Bush's reform "blueprint" is on closing the achievement gap for minority and disadvantaged students. To this end, President Bush signed into law Public Law 107-110 (H.R.1), the No Child Left Behind Act of 2001. The Act contains four major provisions for education reform: stronger accountability for results, expanded flexibility and local control, expanded options for parents, and an emphasis on teaching methods that have been proven to work. Direct accountability based on NAEP is not part of the final public law.
As the nation turned toward an increasing reliance on student assessment so too did the states. States began implementing minimum competency examinations in the 1970s. By 1990 more than forty states in the nation required some form of minimum competency examination (MCE) before awarding the high school diploma. For example, Massachusetts tenth graders are required to pass the Massachusetts Comprehensive Assessment System (MCAS) prior to earning a high school diploma. In Kentucky graduating seniors can distinguish themselves by not only fulfilling the requirements for graduation and a high school diploma but by also fulfilling the requirements for the Commonwealth Diploma that include successful completion of advanced placement credits. Students in Texas must pass secondary-level exit examinations before graduating high school. In North Carolina students must pass exit examinations as part of the requirement set for obtaining a high school diploma. The trend favors the implementation of more examinations with consequences for states at the national level, and for students at the state level.
National Examinations in the United States
In the United States, there are two sets of "national" examinations. The first set includes the SAT, the American College Test (ACT), and the Advanced Placement (AP) examinations, which allow high school seniors to earn college credits for advanced level classes. The second set is known as the National Assessment of Educational Progress (NAEP). These tests have come to be known as "The Nation's Report Card." The NAEP is a federally funded system of assessing the standard of education in the nation. It is the only nationally representative test. It is also the only test given that selects a representative sample of U.S. students, testing the same standard of knowledge over time. It provides no measure of individual student, school, or school district performance.
The National Assessment of Educational Progress (NAEP)
The NAEP is a federally funded national examination that regularly tests a national sample of American students. The National Center for Education Statistics (NCES) of the Department of Education has primary responsibility for the NAEP. The Educational Testing Service (ETS) and Westat currently administer the NAEP. Finally, the NAEP is governed by an independent organization, the National Assessment Governing Board (NAGB) charged with setting policies for the NAEP.
The NAEP has three distinct parts: a long-term trend assessment of nine-, thirteen-, and seventeen-year-olds, a main assessment of the nation, and the state trial assessments. The long-term assessment has been administered to nine-, thirteen-, and seventeen-year-old students every four years since 1969. The main assessment and trial state assessments began in 1990, testing fourth, eighth, and twelfth grades in various subjects. NAEP allows trend evaluations as well as comparisons of performance between groups of students.
NAEP long-term assessment. The NAEP long-term assessment is the only test based on a nationally representative sample of students. It is the only test that can be used to track long-term trends in student achievement. The tests have been given to national samples of nine-, thirteen-, and seventeen-year-old students and have maintained a set of questions such that the results in any given year can be compared to any other year. The long-term or trend assessments are designed to track academic performance over time.
The early tests were given in science, mathematics, and reading, and were administered every four years until 1988 and more frequently after 1988. Writing tests that can be compared over years were given in 1984, and geography, history, civics, and the arts have been tested more recently. Until the early 1980s NAEP results were reported by question, indicating the percentage of students answering a particular question correctly over time.
Overall NAEP scores show small to modest gains. From the early 1970s to 1996, gains occurred in math, reading, and science for nine-and thirteen-year-old students, and in math and reading for seventeen-year-old students. The gains in science were small, approximately.10 standard deviations or three percentile points, for all age groups. Math gains for nine-and thirteen-year-old students were larger, between.15 and.30 standard deviations. Evidence suggests that these trends mask differentiated trends by race and/or ethnic groups.
During the same time period (1970–1996), substantial gains occurred for both Hispanic and black students and for lower-scoring students. For instance, black gains between.30 and.80 standard deviations occurred for almost all subjects in all age groups.
NAEP main assessment and trial state assessments. Between 1984 and 1987 the NAEP underwent an "overhaul" in order to accommodate the increasing demands being placed on it. First, the main assessment and state assessments were added to the NAEP. Second, new instruments were designed to measure not only what students know but also what students should know. Third, test items in every subject area were based on rigorous, challenging content standards. Fourth, NAEP's independent governing board, the National Assessment Governing Board (NAGB), identified performance standards or achievement levels at advanced, proficient, basic, below basic knowledge levels.
Main NAEP, or the main assessment, is administered to a nationally representative sample of students, testing overall student achievement. Unlike the long-term NAEP, main NAEP test items are content-based, reflecting current curriculum and instructional practices. Test items are designed to test student performance in relation to the national education goals and to monitor short-term trends in academic achievement. The main assessment is given in mathematics, reading, writing, and science to fourth, eighth, and twelfth graders at intervals of two years.
The state-level NAEP is administered to representative samples of students within states. The assessment items are the same as those on the main assessment. The trial state assessment and the main assessment are given in mathematics, reading, writing and science at intervals of two years or more beginning in 1990. Between 1990 and 2000 there were seven math tests: an eighth grade assessment in 1990, and fourth and eighth grade exams in 1992, 1996, and 2000. Reading tests have been administered to fourth graders in 1992, 1994, and 1998. In 1998 an eighth grade assessment in reading was administered for the first time.
Significant short-term trends are being made nationally. Statistically significant trends can be seen across the 1990–1992, 1992–1994, and 1992–1996 testing cycles. The largest gains occurred for eighth grade math tests, where composite gains between 1990 and 1996 are about.25 standard deviation, or eight percentile points. Smaller gains of approximately.10 standard deviation or three percentile points occurred in fourth grade math from 1992 to 1996. Reading scores show a decline of approximately.10 standard deviations per year between 1992 and 1994.
The estimated score gains in mathematics indicate a trend of.03 standard deviations or one percentile point per year made between 1990 and 1996 in math across states. About three-quarters of states show consistent, statistically significant annual gains in mathematics between 1990 and 1996. The rate of change varies dramatically across states, from being flat to gains of.06 to.07 standard deviations per year. The sizes of the later gains are remarkable and far above the historical gains of.01 standard deviation per year on the long-term assessment. These results do not change when the fourth and eighth grade 1998 NAEP Reading Assessment and 2000 Mathematics Assessment are added to the sample. (This trend estimates control for student demographics and home environments. A consensus has been reached in the education research community that school systems–here, states–should be judged on score differences and trends beyond the family characteristics that they face.)
International Assessment of Student Performance
International comparisons of student achievement are of growing concern given the trend toward a more global economy. Poor United States performance on international examinations is inextricably linked in the minds of Americans of decreasing economic competitiveness. Part of the debate over voluntary national testing includes the question of linking the NAEP to international tests of achievement as well as to state achievement tests.
Third International Mathematics and Science Study. In 1995 and 1999 the International Association for the Evaluation of Education Assessment (IEA) administered the Third International Mathematics and Science Study (TIMSS) to students in grades 3–4, 7–8 and the last year of secondary schooling. The TIMSS was an ambitious effort. More than forty-one countries and half a million students participated. Cross-country comparisons place the United States in the bottom of the distribution in student performance on the TIMSS. International assessments galvanize support for educational improvement. However, recent evidence suggests the overall poor performance of the United States hides the above average performance of several states: Connecticut, Maine, Massachusetts, Indiana, Iowa, North Dakota, Michigan, Minnesota, Vermont and Wisconsin, all of which are states with higher than average levels of median family income and parental educational attainment.
Programme for International Student Assessment.
In 2000 the Programme for International Student Assessment (PISA) administered examinations of reading, mathematics, and science literacy. Literacy is defined in terms of content knowledge, process ability, and the application of knowledge and skills. Thirty-two countries participated in PISA 2000 with more than 250,000 students representing 17 million students enrolled in secondary education in the participating countries.
The PISA assessments differ from the TIMSS in that
"the assessment materials in TIMSS were constructed on the basis of an analysis of the intended curriculum in each participating country, so as to cover the core material common to the curriculum in the majority of participating countries. The assessment materials in PISA 2000 covered the range of skills and competencies that were, in the respective assessment domains, considered to be crucial to an individual's capacity to fully participate in, and contribute meaningfully to, a successful modern society." (Organisation for Economic Co-operation and Development, p. 27.)
The latter assessment form reflects the growing trends in curricula that emphasize the same thing: training students for life-long learning.
PISA performance in reading, mathematics and science literacy is measured on a zero to 500 scale with a standard deviation of 100. The United States performed at the OECD mean score in reading, mathematics, and scientific literacy. The performance of the United States was better in reading than mathematics; reading and scientific literacy were not statistically different. The United States was outperformed in each literacy domain by Japan, Korea, the United Kingdom, Finland, and New Zealand.
Although there are no federal-level examinations in the United States and there is a distinction between national testing, national standards, and federal testing and standards, the federal government is able to indirectly influence what is taught is public schools. First, the adoption of national goals for education provides states with a centralized framework for what students should know. Second, the federal government awards money to states to develop curriculum and assessments based on these goals. And third, states that align their curriculum and testing to the NAEP achieve higher overall performance measures on the NAEP. All of these and more make the National Assessment of Educational Progress truly a "national" examination.
See also: Assessments, subentry on National Assessment of Educational Achievement; International Assessments, subentry on International Association for the Evaluation of Educational Achievement.
Eckstein, Max A. and Noah, Harold J., eds. 1992. Examinations: Comparative and International Studies. New York: Pergamon Press.
Grissmer, David W., and Flanagan, Ann E. 1998. "Exploring Rapid Test Score Gains in Texas and North Carolina." Commissions paper, Washington, DC: National Education Goals Panel.
Grissmer, David W.; Flanagan, Ann E.; and Williamson, Stephanie. 1998. "Why Did Black Test Scores Rise Rapidly in the 1970s and 1980s?" In The Black-White Test Score Gap, ed. Christopher Jenks and Meredith Phillips. Washington, DC: Brookings Institution Press.
Grissmer, David W., et al. 2000. Improving Student Achievement: What State NAEP Scores Tell Us. Santa Monica, CA: RAND.
Hauser, Robert M. 1998. "Trends in Black-White Test Score Differentials: Uses and Misuses of NAEP/SAT Data." In The Rising Curve: Long-Term Changes in IQ and Related Measures, ed. Ulric Neisser. Washington, DC: American Psychological Association.
Hedges, Larry V., and Nowell, Amy. 1998. "Group Differences in Mental Test Scores: Mean Differences, Variability and Talent." In The Black-White Test Score Gap, ed. Christopher Jenks and Meredith Phillips. Washington, DC: Brookings Institution Press.
International Association for the Evaluation of Educational Achievement, TIMSS International Study Center. 1996. Mathematics Achievement in the Middle School Years: IEA's Third International Mathematics and Science Study TIMSS. Chestnut Hill, MA: Boston College.
National Research Council. 1999. Evaluation of the Voluntary National Tests, Year: Final Report. Washington, DC: National Academy Press.
National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: National Academy Press.
National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: National Academy Press.
National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: National Academy Press.
Organisation for Economic Co-operation and Development, Programme for International Student Assessment. 2001. Knowledge and Skills for Life: First Results from PISA 2000. Paris, France: Organisation for Economic Cooperation and Development.
Ravitch, Diane. 1995. National Standards in American Education: A Citizens Guide. Washington, DC: Brookings Institution Press.
Rothstein, Richard. 1998. The Way We Were? The Myths and Realities of American's Student Achievement. New York: The Century Foundation Press.
U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics.1995. Progress of Education in the United States of America–1990 through 1994. Washington, DC: U.S. Department of Education.
U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics.1997. NAEP 1996 Trends in Academic Progress: Achievement of U.S. Students. Washington, DC:U.S. Department of Education.
U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics.1998. Linking the National Assessment of Educational Progress (NAEP) and the Third International Mathematics and Science Study (TIMSS): Eighth-Grade Results. Washington, DC: U.S. Department of Education.
Executive Office of the White House. 1998. "Goals 2000: Reforming Education to Improve Student Achievement." <www.ed.gov/pubs/G2KReforming>.
Executive Office of the White House. 2002. "No Child Left Behind Act of 2001." <www.ed.gov/legislation/ESEA02/107-110.pdf>.
Ann E. Flanagan
INTERNATIONAL STANDARDS OF TEST DEVELOPMENT
Standardized tests are used in important ways at all levels of education, and such tests can help educators and policymakers make important decisions about students, teachers, programs, and institutions. It is therefore critical that these tests, and the information that they provide, meet the highest professional and technical standards. Fortunately, the experts who set policies for testing programs, who design and develop tests, and who make use of the scores and other reports adhere to a number of rigorous and publicly available standards, three of which merit a brief summary.
Code of Fair Testing Practices in Education
The Code of Fair Testing Practices in Education (the Code ) is one of the most widely distributed and referenced documents in educational testing. It contains standards related to the development, selection, and reporting of results of assessments in education. Written in nontechnical language, it provides test-takers, parents, teachers, and others with clear statements of what they are entitled to receive from those who develop tests, as well as from those who use test scores to help make decisions.
The Code has been endorsed by the leading testing organizations in the United States, including the major nonprofit companies (e.g., the College Board, the Educational Testing Service, ACT Inc.) and the large commercial test publishers (e.g., the California Test Bureau, Harcourt Educational Measurement, Riverside Publishing) who account for a large share of all school district and state-level tests. The Code has also been endorsed by major professional organizations in the field of education, whose members make extensive use of tests, including the American Counseling Association, the American Educational Research Association, the American Psychological Association, the American Speech-Language-Hearing Association, the National Association of School Psychologists, and the National Association of Test Directors.
As a result of the widespread acceptance of the Code, users of standardized educational tests that are developed by major testing companies can be confident that conscientious efforts have been made to produce tests that yield fair and accurate results when used as intended by the test makers.
Standards for Educational and Psychological Testing
The basic reference source for technical standards in educational testing is Standards for Educational and Psychological Testing (the Standards ). Since 1950, this document has been prepared in a series of editions by three organizations, the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). It is a resource that is very useful for individuals with training in psychometrics, but is not very readable by those without such specialized training. Any team involved in the development of a testing program needs to include at least one person with the expertise necessary to understand and assure adherence to the Standards.
ATP Guidelines for Computer-Based Testing
As useful as the AERA/APA/NCME Standards are as a resource for technical testing standards, they give relatively little attention to one of the most important trends in testing, the growth of computer-based testing. The Association of Test Publishers (ATP) has addressed this "standards vacuum" by creating the ATP Guidelines for Computer-Based Testing (the Guidelines ). The ATP is the industry association for test publishing, with over 100 member companies, an active program of publishing, a highly regarded annual meeting focused on computer-based testing, and a set of productive divisions.
The Guidelines address six general areas related to technology-based test delivery systems:
- Planning and Design
- Test Development
- Test Administration
- Scoring and Score Reporting
- Statistical/Psychometric Analyses
- Communications to Test Takers and Others
The intent of the Guidelines is to define the "best practices" that are desirable for all testing systems, without reference to the particular hardware or operating system employed for testing. The fast growth of computer-based testing in education will make these Guidelines especially valuable to test makers and test users.
See also: International Assessments.
American Educational Research Association; American Psychological Association; and National Council on Measurement in Education. 1999. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
Association of Test Publishers. 2000. Guidelines for Computer-Based Testing. Washington, DC: Association of Test Publishers.
Joint Committee on Testing Practices. 1988. Code of Fair Testing Practices in Education. Washington, DC: Joint Committee on Testing Practices.
Tests can be categorized according to the conditions under which they are performed and the purposes they serve. Module testing (or unit testing) is performed on individual components in isolation. At the time that components are brought together to form complete subsystems or systems, integration testing is performed in order to check that the components operate together correctly. Integration testing typically pays particular attention to the interfaces between components. By contrast, system testing normally treats the complete system as a “black box” and investigates its behavior without concern for individual components or internal interfaces. Acceptance testing is normally under the control of the procurers of the system, and is designed to ensure that the system is suitable for operational use.
See also beta test, branch testing, path testing, performance testing, regression testing, statement testing, black-box testing, glass-box testing. Compare static analysis.