national assessment of educational progress
james w. pellegrino
edward a. silver
joan l. herman
stephen a. zuniga
Classroom assessments are those developed or selected by teachers for use during their day-to-day instruction. They are different from the standardized tests that are conducted annually to gauge student achievement, and are most frequently used to serve formative purposes, that is, to help students learn. However, classroom assessments also can be used summatively to determine a student's report card grade. Standardized tests, on the other hand, tend to be considered summative assessments, as they are used to judge student progress over an extended period of time.
As the research summarized below reveals, assessment used during instruction can have a profound impact on student achievement. But to do so, the assessments must provide accurate information and they must be used in appropriate ways.
Research on Impact
In 1984, Benjamin Bloom published a summary of research on the impact of mastery learning models, comparing standard whole-class instruction (the control condition) with two experimental interventions–a mastery learning environment (where students aspire to achieving specific learning standards) and one-on-one tutoring of individual students. One hallmark of both experimental conditions was extensive use of formative classroom assessment during the learning process. Analysis of summative results revealed unprecedented gains in achievement for students in the experimental treatments–when compared to the control groups. To be sure, the entire effect cannot be attributed to the effective use of classroom assessment. But, according to Bloom, a major portion can.
Based on his 1988 compilation of available research, Terry Crooks concluded that classroom assessment can have a major impact on student learning when it:
- Places great emphasis on understanding, not just recognition or recall of knowledge; as well as on the ability to transfer learning to new situations and other patterns of reasoning
- Is used formatively to help students learn, and not just summatively for the assignment of a grade
- Yields feedback that helps students see their growth or progress while they are learning, thereby maintaining the value of the feedback for students
- Relies on student interaction in ways that enhance the development of self-evaluation skills· Reflects carefully articulated achievement expectations that are set high, but attainable, so as to maximize students' confidence that they can succeed if they try and to prevent them from giving up in hopelessness
- Consolidates learning by providing regular opportunities for practice with descriptive, not judgmental, feedback
- Relies on a broad range of modes of assessment aligned appropriately with the diversity of achievement expectations valued in most classrooms
- Covers all valued achievement expectations and does not reduce the classroom to focus only on that which is easily assessed
A decade later, Paul Black and Dylan Wiliam examined the measurement research literature worldwide in search of answers to three questions: (1) Is there evidence that improving the quality and effectiveness of use of formative (classroom) assessments raises student achievement as reflected in summative assessments? (2) Is there research evidence that formative assessments are in need of improvement?(3) Is there evidence about the kinds of improvements that are most likely to enhance student achievement? They uncovered forty articles that addressed the first question with sufficiently rigorous research designs to permit an estimation of the effects of improved classroom assessment on subsequent standardized test scores. They also uncovered profoundly large effects, including score gains that, if realized in the international math and science tests of the 1990s, would have raised the United States and England from the middle of the pack in the rank order of forty-two participating nations to the top five. Black and Wiliam go on to reveal that "improved formative assessment helps low achievers more than other students, and so reduces the range of achievement while raising achievement overall"(p. 141). They contend that this result has direct implications for districts having difficulty reducing achievement gaps between minorities and other students. The answer to their second question is equally definitive. Citing a litany of research similar to that referenced above, they describe the almost complete international neglect in assessment training for teachers.
Their answer to the third question, asking what specific improvements in classroom assessment are likely to have the greatest impact, is the most interesting of all. They describe the positive effects on student learning of (a) increasing the accuracy of classroom assessments, (b) providing students with frequent informative feedback, rather than infrequent judgmental feedback, and (c) involving students deeply in the classroom assessment, record keeping, and communication processes. They conclude that "self-assessment by pupils, therefore, far from being a luxury, is in fact an essential component of formative assessment. When anyone is trying to learn, feedback about the effort has three elements: redefinition of the desired goal, evidence about present position, and some understanding of a way to close the gap between the two. All three must be understood to some degree by anyone before he or she can take action to improve learning"(p. 143).
Standards of Quality
To have such positive effects, classroom assessments must be carefully developed to yield dependable evidence of student achievement. If they meet the five standards of quality described below, they will, in all probability, produce accurate results.
These standards can take the form of the five questions that the developer can ask about the assessment: (1) Am I clear about what I want to assess?(2) Do I know why I am assessing? (3) Am I sure about how to gather the evidence that I need? (4) Have I gathered enough evidence? (5) Have I eliminated all relevant sources of bias in results? Answers to these questions help judge the quality of classroom assessments. Each is considered in greater detail below.
Standard 1. In any classroom assessment context, one must begin the assessment development process by defining the precise vision of what it means to succeed. Proper assessment methods can be selected only when one knows what kind of achievement needs to be assessed. Are students expected to master subject-matter content–meaning to know and understand? If so, does this mean they must know it outright, or does it mean they must know where and how to find it using reference sources? Are they expected to use their knowledge to reason and solve problems? Should they be able to demonstrate mastery of specific performance skills, where it's the doing that is important, or to use their knowledge, reasoning, and skills to create products that meet standards of quality?
Because there is no single assessment method capable of assessing all these various forms of achievement, one cannot select a proper method without a sharp focus on which of these expectations is to be assessed. The main quality-control challenge is to be sure the target is clear before one begins to devise assessment tasks and scoring procedures to measure it.
Standard 2. The second quality standard is to build each assessment in light of specific information about its intended users. It must be clear what purposes a particular assessment will serve. One cannot design sound assessments without asking who will use the results, and how they will use them. To provide quality information that will meet people's needs, one must analyze their needs. For instance, if students are to use assessment results to make important decisions about their own learning, it is important to conduct the assessment and provide the results in a manner that will meet their needs, which might be distinctly different from the information needs of a teacher, parent, or principal. Thus, the developer of any assessment should be able to provide evidence of having investigated the needs of the intended user of that assessment, and of having conducted that assessment in a manner consistent with that purpose. Otherwise the assessment is without purpose. The quality-control challenge is to develop and administer an assessment only after it has been determined precisely who will use its results, and how they will use them.
Within this standard of quality, the impact research cited above suggests that special emphasis be given to one particular assessment user, the student. While there has been a tendency to think of the student as the subject (or victim) of the assessment, the fact is that the decisions students make that are based on teacher assessments of their success drive their ultimate success in school. Thus, it is essential that they remain in touch with and feel in control of their own improvement over time.
Standard 3. Since there are several different kinds of achievement to assess, and since no single assessment method can reflect them all, educators must rely on a variety of methods. The options available to the classroom teacher include selected response (multiple choice, true/false, matching, and fill-in), essays, performance assessments (based on observation and judgment), and direct personal communication with the student. The assessment task is to match a method with an intended target, as depicted
in Table 1. The quality-control challenge is to be sure that everyone concerned with quality assessment knows and understands how the various pieces of this puzzle fit together.
Standard 4. All assessments rely on a relatively small number of exercises to permit the user to draw inferences about a student's mastery of larger domains of achievement. A sound assessment offers a representative sample of all those possibilities that is large enough to yield dependable inferences about how the respondent would perform if given all possible exercises. Each assessment context places its own special constraints on sampling procedures, and the quality-control challenge is to know how to adjust the sampling strategies to produce results of maximum quality at minimum cost in time and effort.
Standard 5. Even if one devises clear achievement targets, transforms them into proper assessment methods, and samples student performance appropriately, there are still factors that can cause a student's score on a test to misrepresent his or her real achievement. Problems can arise from the test, from the student, or from the environment where the test is administered.
For example, tests can consist of poorly worded questions; they can place reading or writing demands on respondents that are confounded with mastery of the material being tested; or they can have more than one correct response, be incorrectly scored, or contain racial or ethnic bias. The student can experience extreme evaluation anxiety or interpret test items differently from the author's intent, and students may cheat, guess, or lack motivation. In addition, the assessment environment could be uncomfortable, poorly lighted, noisy, or otherwise distracting. Any of these factors could give rise to inaccurate assessment results. Part of the quality-control challenge is to be aware of the potential sources of bias and to know how to devise assessments, prepare students, and plan assessment environments to deflect these problems before they ever have an impact on results.
See also: Assessment, subentries on Dynamic Assessment, National Assessment of Educational Progress, Performance Assessment; Assessment Tools, subentries on Psychometric and Statistical, Technology Based; Standards for Student Learning ; Testing, subentry on Standardized Tests and High-Stakes Assessment.
Black, Paul, and Wiliam, Dylan. 1998. "Assessment and Classroom Learning." Assessment in Education 5 (1):7–74.
Black, Paul, and Wiliam, Dylan. 1998. "Inside the Black Box: Raising Standards through Classroom Assessment." Phi Delta Kappan 80 (2):139–148.
Bloom, Benjamin. 1984. "The Search for Methods of Group Instruction as Effective as One-to-One Tutoring." Educational Leadership 41:4–17.
Crooks, Terry J. 1988. "The Impact of Classroom Evaluation on Students." Review of Educational Research 58 (4):438–481.
Stiggins, Richard J. 2001. Student-Involved Classroom Assessment, 3rd edition. Columbus, OH: Merrill.
The term dynamic assessment (DA) refers to an assessment, by an active teaching process, of a child's perception, learning, thinking, and problem solving. The process is aimed at modifying an individual's cognitive functioning and observing subsequent changes in learning and problem-solving patterns within the testing situation. The goals of the DA are to: (a) assess the capacity of the child to grasp the principle underlying an initial problem and to solve it, (b) assess the nature and amount of investment (teaching) that is required to teach a child a given rule or principle, and (c) identify the specific deficient cognitive functions (i.e., systematic exploratory behavior) and non-intellective factors (i.e., need for mastery) that are responsible for failure in performance and how modifiable they are as a result of teaching. In contrast, the term static test (ST) generally refers to a standardized testing procedure in which an examiner presents items to an examinee without any attempt to intervene to change, guide, or improve the child's performance. A static test usually has graduated levels of difficulty, with the tester merely recording and scoring the responses.
DA is usually administered to children who demonstrate some learning disability, low scores on standardized tests, or some emotional or personality disturbance. Very frequently it is given to children coming from a low socioeconomic or culturally different background. The differences between the ST and DA approaches derive from different philosophical perspectives: ST is related to passive acceptance (acceptance of a child's disability and accommodation of the environment to fit these disabilities), while DA is based on active modification (active efforts to modify the child's disabilities by intensive mediation and the establishment of relatively high cognitive goals).
DA development has been motivated by the inadequacy of standardized tests. The inadequacy can be summarized in the following points: (1) Static tests do not provide crucial information about learning processes, deficient cognitive functions that are responsible for learning difficulties, and mediational strategies that facilitate learning. (2) The manifested low performance level of many children, as revealed in ST, very frequently falls short of revealing their learning potential, especially of those identified as coming from disadvantaged social backgrounds, or as having some sort of learning difficulty. Many children fail in static tests because of lack of opportunities for learning experiences, cultural differences, specific learning difficulties, or traumatic life experiences. (3) In many static tests children are described in general terms, mostly in relation to their relative position of their peer group, but they do not provide clear descriptions of the processes involved in learning and recommendations for prescriptive teaching and remedial learning strategies. (4) Static tests do not relate to non-intellective factors that can influence individuals' cognitive performance, sometimes more than the "pure" cognitive factors. Nonintellective factors (i.e., intrinsic motivation, need for mastery, locus of control, anxiety, frustration, tolerance, self-confidence, and accessibility to mediation) are no less important in determining children's intellectual achievements than are the "pure" cognitive factors. This is especially true with individuals whose emotional or motivational problems interfere with their cognitive performance.
In comparison with ST, DA is designed to provide accurate information about: (a) an individual's current learning ability and learning processes; (b) specific cognitive factors (i.e., impulsivity, planning behavior) responsible for problem-solving ability and academic success or failure; (c) efficient teaching strategies for the child being studied; and (d) motivational, emotional, and personality factors that affect cognitive processes.
Lev Vygotsky's concept of a zone of proximal development (ZPD) and Reuben Feuerstein's theory of mediated learning experience (MLE) served as the main conceptual bases for most of the DA elaboration. The ZPD is defined as the difference between a child's "actual developmental level as determined by independent problem solving" and the higher level of "potential development as determined through problem solving under adult guidance or in collaboration with more capable peers" (Vygotsky, p. 86). MLE interactions are defined as a process in which parents or experienced adults interpose themselves between a set of stimuli and a child and modify the stimuli for the developing child. In a DA context, the examiner mediates the rules and strategies for solving specific problems on an individual basis, and assesses the level of internalization (i.e., deep understanding) of these rules and strategies as well as their transfer value to other problems of increased level of complexity, novelty, and abstraction.
The Nature of Dynamic Assessment
DA is meant to be a complement to standardized testing, not a substitute for it. It is presented as a broad approach, not as a particular test. Different criteria of change are used in DA: pre-to post-teaching gains, amount and type of teaching required, and the degree of transfer of learning. The choice to use change criteria to predict future cognitive performance (as well as predicted outcome of intervention programs) is based on the belief that measures of change are more closely related to teaching processes (by which the child is taught how to process information), than they are to conventional measures of intelligence. The major differences between DA and conventional tests in regard to goals, testing processes, types of instruments, test situations, and interpretation of results, are presented in Table 1.
Using DA. Clinical experience has shown that it is most useful to use DA when standardized tests yield low scores; when standardized tests hover around margins of adequacy in cognitive functioning; when there are serious discrepancies between a child's test scores and academic performance; when a child comes from a low socioeconomic or culturally or linguistically different background; or when a child shows some emotional disturbance, personality disorder, or learning disability.
Reliability of DA. One of the objectives of DA is to change an individual's cognitive functioning within the testing context so as to produce unreliability among test items (i.e., lack of consistency between repeated responses). DA reliability is usually assessed by interrater agreement (two or more observers rate the child's behavior) regarding the child's cognitive performance, mediation (teaching) strategies required to change the child's functioning, cognitive functions (i.e., level of impulsivity, planning behavior) that affect performance, and motivational-emotional factors. Such test reliability has been demonstrated with learning disabled and educable mentally retarded (EMR) children. Overall inter-rater agreement for the type of intervention (mediation) required to change a child's performance for deficient cognitive functions, such as impulsivity, lack of planning, and lack of systematic behavior, has been shown to be about 89 percent. For different cognitive tasks, different profiles of deficient cognitive functions have been observed and different types of teaching can be applied.
Educational perspectives. Previous research has shown that standardized IQ tests underestimate the cognitive ability of children from low socioeconomic settings, from minority groups, and children having learning difficulties. Criteria of change (i.e., pre-to post-teaching gains on a test), as measured by DA, have been found to be more powerful in predicting academic performance, more accurate in prescribing individualized educational plans and specific cognitive interventions, and better able to distinguish between different clinical groups than ST scores. David Tzuriel and Pnina Klein, using the Children's Analogical Thinking Modifiability (CATM) test, showed that the highest pre-to post-teaching gains were found among children identified either as disadvantaged or advantaged, compared with children with special education needs and mentally retarded children. Higher levels of functioning, for all groups, were found on the CATM than on the Raven's Colored Progressive Matrices (RCPM)–when the latter was given as a standardized test–especially when comparing performance on analogy items of the RCPM versus problems on the CATM. The advantaged and disadvantaged children scored 69 percent and 64 percent, respectively, on the CATM, compared with 39 percent and 44 percent on the RCPM. The effects of teaching were more articulated in difficult tasks than in easy ones.
Findings with the Children's Inferential Thinking Modifiability (CITM) and the Children's Seriational Thinking Modifiability (CSTM) tests indicate that children from minority groups or disadvantaged background have an initial lower level of functioning than children from mainstream groups or an advantaged background. After a teaching phase, however, they showed higher levels of gain and narrowed the gap. The gap between the two groups was also narrower in a transfer phase consisting of more difficult problems. The degree of improvement was higher in high-complexity problems than in lowcomplexity problems.
In several studies DA was found to verify the distinction between cultural deprivation and culturaldifference. Tzuriel, following Feuerstein, differentiated between those who function poorly as a result of cultural differences and those who have experienced cultural deprivation. The DA approach, in this respect, offers a solution not only for its differential diagnostic value, but also for its potential prescriptive remediation of deficiencies and its enhancement of learning processes.
For certain DA measures, significant positive correlations have been found between the level of difficulty of an item and the level of improvement on that item, and DA post-teaching scores have been shown to be better predictors of academic achievement than static scores. In addition, a higher prediction value was found among children with high learning potential than among children with average learning potential. Findings of many studies raise heavy doubts, especially with low functioning groups, about the ability of ST scores to represent accurately an individual's ability and to serve as indicators for future intervention and change.
Evaluation of Cognitive Education Programs
Dynamic assessment has also been used to evaluate cognitive education programs designed to develop learning and thinking skills. Given that one of the major goals of these programs is advancing learning to learn skills, it is essential that the change criteria in DA be assessed several studies have shown that experimental groups who received any one of a number of cognitive education programs (e.g., Bright Start, Instrumental Enrichment, Peer-Mediation with Young Children) attained higher pre-to post-teaching gains on DA tests than did control groups. The DA scores depicted the effects of the intervention better than ST scores did.
Developmental research using DA has focused on predicting learning ability by assessing the quality of parent–child interactions, specifically mother–child mediated learning experience (MLE). MLE interactions are defined as an interactional process in which parents, or substitute adults, interpose themselves between a set of stimuli and the child and modify the stimuli for the developing child. The mediator modifies the stimuli by focusing the child on their characteristics, by arousing curiosity, vigilance, and perceptual acuity in the child, and by trying to improve and/or create in the child the cognitive functions required for temporal, spatial, and cause-effect relationships. Major findings have been that children's post-teaching scores are more accurately predicted by MLE mother–child interactions than by ST scores and that mediation for transcendence (expanding an experience by making a rule or principle, generalizing an event beyond the concrete experience) has emerged as the most powerful predictor of children's ability to change following teaching. These findings support the hypothesis that mother–child mediation strategies are internalized and used later in other learning contexts. Children whose mothers used a high level of mediation for transcendence internalized the mechanism and used it in other learning contexts where they needed this type of mediation. Findings of several studies confirm the hypothesis that MLE interactions, conceptualized as the proximal factor of cognitive development (i.e., directly explaining cognitive functioning), predicted children's cognitive change, whereas distal factors (i.e., SES, mothers' IQ, child's personality orientation, mother's emotional attitudes toward the child) did not predict cognitive change in children.
In spite of the efficacy of DA, some problems exist. First, DA takes more time to administer and requires more skill, better training, more experience, and greater effort than ST. A cost-effectiveness issue is raised by psychologists, educators, and policymakers who are not convinced that the information derived from DA is worth the investment required to get it, and that the information acquired will then be used efficiently to enhance specific learning strategies and academic achievements.
Second, the extent to which cognitive modifiability is generalized across domains needs further investigation. This issue has practical implications for the designing of tests and mediational procedures. Third, validation of DA is much more complex than validation of ST because it has a broader scope of goals (assessing initial performance, deficient cognitive functions, type and amount of mediation, nonintellective factors, and certain parameters of change). In validating DA one needs to develop criteria variables that measure changes that are due to a cognitive intervention.
Finally, the literature is replete with evidence showing a strong relation between IQ (an ST measure) and school achievement (r = .71). This means that nearly 50 percent of the variance in learning outcomes for students can be explained by differences in psychometric IQ. However, three extremely important questions remain: (1) What causes the other 50 percent of achievement variance? (2) When IQ predicts low achievement, what is necessary to defeat that prediction? and (3) What factors influencing the unexplained variance can help to defeat the prediction in the explained variance?
See also: Assessment, subentry on Classroom Assessment; Testing, subentry on Standardized Tests and High-Stakes Testing.
Feuerstein, Reuven; Rand, Ya'cov; and Hoffman, Mildred B. 1979. The Dynamic Assessment of Retarded Performers: The Learning Potential Assessment Device: Theory, Instruments, and Techniques. Baltimore: University Park Press.
Haywood, H. Carl. 1997. "Interactive Assessment." In Assessment of Individuals with Mental Retardation, ed. Ronald L. Taylor. San Diego, CA: Singular.
Haywood, H. Carl, and Tzuriel, David, eds. 1992. Interactive Assessment. Berlin: Springer-Verlag.
Lidz, Carol S., ed. 1987. Dynamic Assessment. New York: Guilford.
Tzuriel, David. 1999. "Parent-Child Mediated Learning Transactions as Determinants of Cognitive Modifiability: Recent Research and Future Directions." Genetic, Social, and General Psychology Monographs 125:109–156.
Tzuriel, David. 2001. Dynamic Assessment of Young Children. New York: Kluwer Academic/Plenum.
Tzuriel, David, and Haywood, H. Carl. 1992. "The Development of Interactive-Dynamic Approaches for Assessment of Learning Potential. In Interactive Assessment, ed. H. Carl Haywood and David Tzuriel. New York: Springer-Verlag.
Tzuriel, David, and Klein, Pnina S. 1985. "Analogical Thinking Modifiability in Disadvantaged, Regular, Special Education, and Mentally Retarded Children." Journal of Abnormal ChildPsychology 13:539–552.
Tzuriel, David, and Samuels, Marilyn T. 2000. "Dynamic Assessment of Learning Potential: Inter-Rater Reliability of Deficient Cognitive Functions, Type of Mediation, and Non-Intellective Factors." Journal of Cognitive Education and Psychology 1:41–64.
Vygotsky, Lev. s. 1978. Mind in Society. Cambridge, MA: Harvard University Press.
NATIONAL ASSESSMENT OF EDUCATIONAL PROGRESS
The primary means to monitor the status and development of American education, the National Assessment of Education Progress (NAEP), was conceived in 1963 when Francis Keppel, U.S. Commissioner of Education, appointed a committee to explore options for assessing the condition of education in the United States. The committee, chaired by Ralph Tyler, recommended that an information system be developed based on a battery of psychometric tests.
NAEP's Original Purpose and Design
A number of key features were recommended in the original design of NAEP, several of which were intended to make it substantially different than typical standardized tests of academic achievement. Many, but not all, of these features were incorporated into the first assessments and have persisted throughout NAEP's history. Others have changed in response to policy needs.
With respect to matters of content, each assessment cycle was supposed to target one or more broadly defined subject areas that corresponded to familiar components of school curricula, such as mathematics. For each subject area, panels of citizens would be asked to form consensus groups about appropriate learning objectives at each target age. Test questions or items were to be developed bearing a one-to-one correspondence to particular learning objectives. Thus, from NAEP's beginning, there have been heavy demands for content validity as part of the assessment development process.
Several interesting technical design features were proposed for the assessment program. Of special note was the use of matrix-sampling, a design that distributes large numbers of items broadly across school buildings, districts, and states but limits the number of items given to individual examinees. In essence, the assessment was designed to glean information from hundreds of items, several related to each of many testing objectives, while restricting the amount of time that any student has to spend responding to the assessment. The target period proposed was approximately fifty minutes per examinee. All test items were to be presented by trained personnel rather than by local school personnel in order to maintain uniformly high standards of administration.
The populations of interest for NAEP were to be all U.S. residents at ages 9, 13, and 17, as well as young adults. This would require the selection of private and public schools into the testing sample, as well as selection of examinees at each target age who were not in school. Results would be tabulated and presented by age and by demographic groups within age–but never by state, state subunit, school district, school, or individual. Assessment results would be reported to show the estimated percentage of the population or subpopulation that answered each item and task correctly. And finally, only a subset of the items would be released with each NAEP report. The unreleased items would remain secure, to be administered at a later testing for determining performance changes over time, thereby providing the basis for determining trends in achievement.
The agenda and design laid out for NAEP in the mid-1960s reflected the political and social realities of the time. Prominent among these was the resistance of state and local policymakers to a national curriculum; state and local leaders feared federal erosion of their autonomy and voiced concern about pressure for accountability. Several of NAEP's features thwarted perceptions of the program as a federal testing initiative addressing a nationally prescribed curriculum. Indeed, NAEP's design provided nationally and regionally representative data on the educational condition of American schools, while avoiding any implicit federal standards or state, district, and school comparisons. NAEP was coined the "nation's educational barometer." It became operational in 1969 and 1970 and the first assessments were in science, citizenship, and writing.
Pressures for Redesign of NAEP
As federal initiatives during the 1960s and 1970s expanded educational opportunities, they fostered an administrative imperative for assessment data to help gauge the effect on the nation's education system. NAEP's original design could not accommodate the increasing demands for data about educationally important populations and issues. Age-level (rather than grade-level) testing made it difficult to link NAEP results to state and local education policies and school practices. Furthermore, its reporting scheme allowed for measurement of change on individual items, but not on the broad subject areas; monitoring the educational experiences of students in varied racial and ethnic, language, and economic groups was difficult without summary scores. Increasingly, NAEP was asked to provide more information so that government and education officials would have a stronger basis for making judgments about the adequacy of education services; NAEP's constituents were seeking information that, in many respects, conflicted with the basic design of the program.
The first major redesign of NAEP took place in 1984, when responsibility for its development and administration was moved from the Education Commission of the States to the Educational Testing Service. The design for NAEP's second generation changed the procedures for sampling, objectivesetting, item development, data collection, and analysis. Tests were administered by age and grade groupings. Summary scores were provided for each subject area; scale scores were introduced for reporting purposes. These and other changes afforded the program much greater flexibility in responding to policy demands as they evolved.
Almost concurrently, however, the report A Nation at Risk was issued in 1983. It warned that America's schools and students were performing poorly and spawned a wave of state-level education reforms. As states invested more and more in their education systems, they sought information about the effectiveness of their efforts. State-level policymakers looked to NAEP for guidance on the effectiveness of alternative practices. The National Governors' Association issued a call for state-comparable achievement data, and a new report, The Nation's Report Card, recommended that NAEP be expanded to provide state-level results.
As the program retooled to accommodate this change, participants in a 1989 education summit in Charlottesville, Virginia, set out to expand NAEP even further. President George Bush and the nation's governors challenged the prevailing assumptions about national expectations for achievement in American schools. They established six national goals for education and specified the subjects and grades in which progress should be measured with respect to national and international frames of reference. By design, these subjects and grades paralleled NAEP's structure. The governors called on educators to hold students to "world-class" standards of knowledge and skill. The governors' commitment to high academic standards included a call for the reporting of NAEP results in relation to rigorous performance standards. They challenged NAEP to describe not only what students currently know and can do, but also what young people should know and be able to do as participants in an education system that holds its students to high standards.
NAEP in the Early Twenty-First Century
The program that took shape during the 1990s is the large and complex NAEP that exists in the early twenty-first century. The NAEP program continues to evolve in response to both policy challenges and results from federally mandated external evaluations. NAEP includes two distinct assessment programs with different instrumentation, sampling, administration, and reporting practices. The two assessments are referred to as trend NAEP and main NAEP.
Trend NAEP is a collection of test items in reading, writing, mathematics, and science that have been administered many times since the 1970s. As the name implies, trend NAEP is designed to document changes in academic performance over time. During the 1990s, trend NAEP was administered in 1990, 1992, 1994, 1996, and 1999. Trend NAEP is administered to nationally representative samples of students aged 9, 13, and 17 following the original NAEP design.
Main NAEP consists of test items that reflect current thinking about what students should know and be able to do in the NAEP subject areas. They are based on contemporary content and skill outlines developed by consensus panels for reading, writing, mathematics, science, U.S. history, world history, geography, civics, the arts, and foreign languages. These content frameworks are periodically reviewed and revised.
Main NAEP is further complicated by having two components, national NAEP and state NAEP. The former assesses nationally representative samples of students in grades 4, 8, and 12. In most but not all subjects, national NAEP is supposed to be administered two, three, or four times during a twelve-year period, to make it possible to examine short term trends in performance over a decade. State NAEP assessments are administered to state representative samples of students in states that voluntarily elect to participate in the program. State NAEP uses the same large-scale assessment materials as those used in national NAEP, but is only administered in grades four and eight in reading, writing, mathematics, and science. In contrast to national NAEP, the tests are administered by local school personnel rather than an independent contractor.
One of the most substantial changes in the main NAEP program is the reporting of results relative to performance standards. In each content area, performance standards are defined for three levels of achievement: basic, proficient, and advanced. The percentage of students at a given grade level whose performance is at or above an achievement level standard is reported, as are trends in the percentages over successive administrations of NAEP in a content area. Achievement level reporting is done for both main NAEP and state NAEP and has become one of the most controversial aspects of the NAEP program.
NAEP's complex design is mirrored by a complex governance structure. The program is governed by the National Assessment Governing Board (NAGB), appointed by the secretary of education but independent of the department. The board, authorized to set policy for NAEP, is designed to be broadly representative of NAEP's varied audiences. It selects the subject areas to be assessed and ensures that the content and skill frameworks that specify goals for assessment are produced through a national consensus process. In addition, NAGB establishes performance standards for each subject and grade tested, in consultation with its contractor for this task. NAGB also develops guidelines for NAEP reporting. The commissioner of education statistics, who leads the National Center for Education Statistics (NCES) in the U.S. Department of Education, retains responsibility for NAEP operations and technical quality control. NCES procures test development and administration services from cooperating private companies.
Evaluations of NAEP
As part of the process of transforming and expanding NAEP during the 1990s, Congress mandated periodic, independent evaluations of the NAEP program. Two such multiyear evaluations were conducted, the first by the National Academy of Education and the second by the National Academy of Sciences. Both evaluations examined several features of the NAEP program design including development of the assessment frameworks, the technical quality of the assessments, the validity of the achievement level reporting, and initiation of the state NAEP assessments. The evaluations concluded that there are many laudatory aspects of NAEP supporting its label as the "gold standard" for assessment of academic achievement. Among the positives is NAEP's attempt to develop broad, consensus-based content area frameworks, incorporate constructed response tasks and item formats that tap more complex forms of knowledge, use matrix sampling to cover a wide range of curriculum content area topics, and employ powerful statistical methods to analyze the results and develop summary scores. These evaluations also concluded that state NAEP, which had developmental status at the start of the 1990s, served a valuable purpose and should become a regular part of the NAEP program, which it did.
The two evaluations also saw considerable room for improvement in NAEP, in many of the areas mentioned above where strength already existed. Two areas of concern were of particular note. The first was the need to broaden the range of knowledge and cognitive skills that should be incorporated into NAEP's assessment frameworks and included as part of the assessment design. Both evaluations argued that NAEP was not fully taking advantage of advances in the cognitive sciences regarding the nature of knowledge and expertise and that future assessments needed to measure aspects of knowledge that were now deemed to be critical parts of the definition of academic competence and achievement. Suggestions were made for how NAEP might do this by developing a portfolio of assessment methods and approaches.
The second major area of concern was the validity of the achievement level analysis and reporting process. Both evaluations, as well as others that preceded them, were extremely critical of both the process that NAEP was using to determine achievement levels and the outcomes that were reported. It was judged that the entire achievement level approach lacked validity and needed a major conceptual and operational overhaul. As might be expected, this critique met with less than resounding approval by the National Assessment Governing Board, which is responsible for the achievement level–setting process.
Many of the concerns raised in the two major evaluations of NAEP, along with many other reviews of various aspects of the NAEP program, have served as stimuli in an ongoing process of refining, improving, and transforming NAEP. One of NAEP's hallmarks as an assessment program is its capacity to evolve, engage in cutting edge assessment development work, and provide results of value to many constituencies. It continues to serve its role as "The Nation's Report Card."
See also: Assessment Tools, subentries on Psychometric and Statistical, Technology Based.
Alexander, Lamar. 1991. America 2000. Washington, DC: U.S. Department of Education.
Alexander, Lamar, and James, H. Thomas. 1987. The Nation's Report Card: Improving the Assessment of Student Achievement. Stanford, CA: National Academy of Education.
Glaser, Robert; Linn, Robert; and Bohrnstedt, George. 1992. Assessing Student Achievement in the States. Stanford, CA: National Academy of Education.
Glaser, Robert; Linn, Robert; and Bohrnstedt, George. 1993. The Trial State Assessment: Prospects and Realities. Stanford, CA: National Academy of Education.
Glaser, Robert; Linn, Robert; and Bohrnstedt, George. 1996. Quality and Utility: The 1994 Trial State Assessment in Reading. Stanford, CA: National Academy of Education.
Glaser, Robert; Linn, Robert; and Bohrnstedt, George. 1997. Assessment in Transition: Monitoring the Nation's Educational Progress. Stanford, CA: National Academy of Education.
Jones, Lyle V. 1996. "A History of the National Assessment of Educational Progress and Some Questions about Its Future." Educational Researcher 25 (6):1–8.
Messick, Samuel; Beaton, Albert; and Lord, Frederick. 1983. National Assessment of Educational Progress Reconsidered: A New Design for a New Era. Princeton, NJ: Educational Testing Service.
National Center for Education Statistics. 1974. NAEP General Information Yearbook. Washington, DC: U.S. Department of Education.
National Commission on Excellence in Education. 1983. A Nation at Risk: The Imperative for Educational Reform. Washington, DC: U.S. Government Printing Office.
Office of Technology Assessment. 1992. Testing in America's Schools: Asking the Right Questions. Washington, DC: U.S. Government Printing Office.
Pellegrino, James W.; Jones, Lee R.; and Mitchell, Karen J. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: National Academy Press.
James W. Pellegrino
The term performance assessment (PA) is typically used to refer to a class of assessments that is based on observation and judgment. That is, in PA an assessor usually observes a performance or the product of a performance and judges its quality. For example, to judge one's competence to operate an automobile, it is normally required that one pass a road test, during which actual driving is observed and evaluated. Similarly, Olympic athletes are judged on the basis of observed performances. PA has long been used to judge proficiency in industrial, military, and artistic settings, and interest in its application to educational settings has grown at the start of the twenty-first century.
Educators' interest in PA can be attributed to several factors. It has been argued that performance measures offer a potential advantage of increased validity over other forms of testing that rely on indirect indicators of a desired competence or proficiency. That is, to assess ability to spell one might prefer to have direct evidence that a person can spell words correctly rather than inferring the ability from tasks that involve identifying misspelled words in a list. Proponents of performance assessment have identified many possible benefits, such as allowing a broad range of learning outcomes to be assessed and preserving the complex nature of disciplinary knowledge and inquiry, including conceptual understanding, problem-solving skills, and the application of knowledge and understanding to unique situations. Of particular interest is the potential of PA to capture aspects of higher-order thinking and reasoning, which are difficult to test in other ways.
Moreover, because some research has reported that teachers tend to adapt their instructional practice to reflect the form and content of external assessments, and because performance assessments tend to be better than conventional forms of testing at capturing more complex instructional goals and intentions, it has been argued that "teaching to the test" might be a positive consequence if PA were used to evaluate student achievement. Finally, some proponents have argued that PA could be more equitable than other forms of assessment because PA can engage students in "authentic," contextualized performance, closely related to important instructional goals, thus avoiding the sources of bias associated with testing rapid recall of decontextualized information.
Educational Uses of Performance Assessment
Although performance assessment has been employed in many educational settings, including the assessment of teachers, a primary use in education has been to assess student learning outcomes. PA has long been used in classrooms by teachers to determine what has been learned and by whom. PA may be applied in the classroom in informal ways (as when a teacher observes a student as she solves a problem during seat work) or in more formal ways (as when a teacher collects and scores students' written essays). Within the classroom PA can serve as a means of assigning course grades, communicating expectations, providing feedback to students, and guiding instructional decisions. When PA is used for internal classroom assessment, both the form and content of the assessment can be closely aligned with a teacher's instructional goals. Therefore, the use of performance assessment in the classroom has been seen by some as a promising means of accomplishing a long-standing, elusive goal–namely, the integration of instruction and assessment.
Performance assessment has also been employed in the external assessment of student learning outcomes. PA received significant attention from educators and assessment specialists during the latter part of the 1980s and throughout the 1990s. This increased interest in PA occurred as subject matter standards were established and corresponding changes in instructional practice were envisioned. A growing dissatisfaction with selected-response testing (e.g., true/false questions and multiple-choice items) and an awareness of advances in research in cognition and instruction also spawned interest in PA. Constructed-response tasks (e.g., tasks calling for brief or extended explanations or justifications) became increasingly popular as a means of capturing much of what is valued instructionally in a form that could be included in an external assessment of student achievement. In addition, for subjects such as science and mathematics, tasks that involve handson use of materials and tools have been developed. The net result of approximately fifteen years of research and development effort is the inclusion of written essays and constructed-response tasks in tests intended to assess achievement in various subject areas, including writing, history, mathematics, and science. A survey of state assessment practices in the mid-1990s found that thirty-four states required writing samples, and ten states incorporated constructed-response tasks into their assessments.
Performance Assessment: Challenges and Opportunities
A variety of technical and feasibility issues have plagued attempts to employ PA on a large scale. Among the technical issues that await satisfactory resolution are concerns about ensuring generalizability and comparability of performance across tasks and concerns about the scoring of complex tasks and the appropriate interpretation of performances. Efforts to use PA have also been limited due to concerns about the relatively high costs of development, administration, and scoring, when compared to more conventional testing. Finally, despite the hopes of advocates of PA regarding the likely benefits of its widespread adoption, some analyses have raised concerns about equity issues and the limited positive impact on classroom teaching of using PA in external testing.
Despite the problems that have prevented widespread adoption of performance assessment, many educators and assessment experts remain enthusiastic about the potential of PA to address many limitations of other forms of assessment. In particular, advances in the cognitive sciences and technology, along with the increasing availability of sophisticated technological tools in educational settings, may provide new opportunities to resolve many of these issues. For example, the costs of development, administration, and scoring may be decreased through the use of new technologies. And generalizability across tasks may be increased through the use of intelligent systems that offer ongoing assessment well integrated with instruction and sensitive to changes in students' understanding and performance, with performance data collected over a long period of time as opposed to one-time, on-demand testing.
See also: Assessment, subentry on Classroom Assessment; Standards for Student Learning.
Airasian, Peter W. 1991. Classroom Assessment. New York: McGraw-Hill.
Baxter, Gail P., and Glaser, Robert. 1998. "Investigating the Cognitive Complexity of Science Assessments." Educational Measurement: Issuesand Practice 17 (3):37–45.
Bennett, Randy E., and Ward, William C., eds. 1993. Construction Versus Choice in Cognitive Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates.
Bond, Lloyd A. 1995. "Unintended Consequences of Performance Assessment: Issues of Bias and Fairness." Educational Measurement: Issues andPractice 14 (4):21–24.
Bond, Lloyd A.; Braskamp, David; and Roeber, Edward. 1996. The Status Report of the Assessment Programs in the United States. Oak Brook, IL: North Central Regional Educational Laboratory.
Brennan, Robert L., and Johnson, Eugene G. 1995. "Generalizability of Performance Assessments." Educational Measurement: Issues andPractice 14 (4):9–12, 27.
Cole, Nancy S. 1988. "A Realist's Appraisal of the Prospects for Unifying Instruction and Assessment." Assessment in the Service of Learning: Proceedings of the 1987 ETS Invitational Conference. Princeton, NJ: Educational Testing Service.
Darling-Hammond, Linda. 1995. "Equity Issues in Performance-Based Assessment." In Equity and Excellence in Educational Testing and Assessment, ed. Michael T. Nettles and Arie L. Nettles. Boston: Kluwer.
Frederiksen, John R., and Collins, Allan. 1989. "A Systems Approach to Educational Testing." Educational Researcher 18 (9):27–32.
Gao, X. James; Shavelson, Richard J.; and Baxter, Gail P. 1994. "Generalizability of Large-Scale Performance Assessments in Science: Promises and Problems." Applied Measurementin Education 7:323–342.
Glaser, Robert, and Silver, Edward A. 1994. "Assessment, Testing, and Instruction: Retrospect and Prospect." In Review of Research in Education, Vol. 20, ed. Linda Darling-Hammond. Washington, DC: American Educational Research Association.
Green, Bert F. 1995. "Comparability of Scores from Performance Assessments." Educational Measurement: Issues and Practice 14 (4):13–15,24.
Heubert, Jay, and Hauser, Robert. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: National Academy Press.
Messick, Samuel. 1994. "The Interplay of Evidence and Consequences in the Validation of Performance Assessments." Educational Researcher 23 (1):13–23.
Messick, Samuel, ed. 1995. "Special Issue: Values and Standards in Performance Assessment: Issues, Findings, and Viewpoints." Educational Measurement: Issues and Practice 14 (4).
Pellegrino, James; Chudowsky, Naomi; and Glaser, Robert. 2001. Knowing What Students Know: The Science and Design of Educational Assessment. Washington, DC: National Academy Press.
Reckase, Mark, ed. 1993. "Special Issue: Performance Assessment." Journal of Educational Measurement 30 (3).
Resnick, Lauren B., and Resnick, Daniel P. 1992. "Assessing the Thinking Curriculum: New Tools for Educational Reform." In Changing Assessments: Alternative Views of Aptitude, Achievement, and Instruction, ed. Bernard R. Gifford and Mary C. O'Connor. Boston: Kluwer.
Shavelson, Richard J.; Baxter, Gail P.; and Gao, X. James. 1993. "Sampling Variability of Performance Assessments." Journal of Educational Measurement 30:215–232.
Shavelson, Richard J.; Baxter, Gail P.; and Pine, Jerry. 1992. "Performance Assessments: Political Rhetoric and Measurement Reality." Educational Researcher 21 (4):22–27.
Silver, Edward A.; Alacaci, Cengiz; and Stylianou, Despina. 2000. "Students' Performance on Extended Constructed-Response Tasks." In Results from the Seventh Mathematics Assessment of the National Assessment of Educational Progress, ed. Edward A. Silver and Patricia A. Kenney. Reston, VA: National Council of Teachers of Mathematics.
Smith, Mary L. 1991. "Put to the Test: The Effects of External Testing on Teachers." Educational Researcher 20 (5):8–11.
Wiggins, Grant. 1989a. "Teaching to the (Authentic) Test." Educational Leadership 46 (7):41–47.
Wiggins, Grant. 1989b. "A True Test: Toward More Authentic and Equitable Assessment." Phi Delta Kappan 70:703–713.
Wiggins, Grant. 1992. "Creating Tests Worth Taking." Educational Leadership 49 (8):26–33.
Wolf, Dennie; Bixby, Janet; Glenn, John, III; and Gardner, Howard. 1991. "To Use Their Minds Well: Investigating New Forms of Student Assessment." In Review of Research in Education, Vol. 17, ed. Gerald Grant. Washington, DC: American Educational Research Association.
Edward A. Silver
Portfolio assessment is a term with many meanings, and it is a process that can serve a variety of purposes. A portfolio is a collection of student work that can exhibit a student's efforts, progress, and achievements in various areas of the curriculum. A portfolio assessment can be an examination of student-selected samples of work experiences and documents related to outcomes being assessed, and it can address and support progress toward achieving academic goals, including student efficacy. Portfolio assessments have been used for large-scale assessment and accountability purposes (e.g., the Vermont and Kentucky statewide assessment systems), for purposes of school-to-work transitions, and for purposes of certification. For example, portfolio assessments are used as part of the National Board for Professional Teaching Standards assessment of expert teachers.
The Development of Portfolio Assessment
Portfolio assessments grew in popularity in the United States in the 1990s as part of a widespread interest in alternative assessment. Because of high-stakes accountability, the 1980s saw an increase in norm-referenced, multiple-choice tests designed to measure academic achievement. By the end of the decade, however, there were increased criticisms over the reliance on these tests, which opponents believed assessed only a very limited range of knowledge and encouraged a "drill and kill" multiple-choice curriculum. Advocates of alternative assessment argued that teachers and schools modeled their curriculum to match the limited norm-referenced tests to try to assure that their students did well, "teaching to the test" rather than teaching content relevant to the subject matter. Therefore, it was important that assessments were worth teaching to and modeled the types of significant teaching and learning activities that were worthwhile educational experiences and would prepare students for future, real-world success.
Involving a wide variety of learning products and artifacts, such assessments would also enable teachers and researchers to examine the wide array of complex thinking and problem-solving skills required for subject-matter accomplishment. More likely than traditional assessments to be multidimensional, these assessments also could reveal various aspects of the learning process, including the development of cognitive skills, strategies, and decision-making processes. By providing feedback to schools and districts about the strengths and weaknesses of their performance, and influencing what and how teachers teach, it was thought portfolio assessment could support the goals of school reform. By engaging students more deeply in the instructional and assessment process, furthermore, portfolios could also benefit student learning.
Types of Portfolios
While portfolios have broad potential and can be useful for the assessments of students' performance for a variety of purposes in core curriculum areas, the contents and criteria used to assess portfolios must be designed to serve those purposes. For example, showcase portfolios exhibit the best of student performance, while working portfolios may contain drafts that students and teachers use to reflect on process. Progress portfolios contain multiple examples of the same type of work done over time and are used to assess progress. If cognitive processes are intended for assessment, content and rubrics must be designed to capture those processes.
Portfolio assessments can provide both formative and summative opportunities for monitoring progress toward reaching identified outcomes. By setting criteria for content and outcomes, portfolios can communicate concrete information about what is expected of students in terms of the content and quality of performance in specific curriculum areas, while also providing a way of assessing their progress along the way. Depending on content and criteria, portfolios can provide teachers and researchers with information relevant to the cognitive processes that students use to achieve academic outcomes.
Uses of Portfolios
Much of the literature on portfolio assessment has focused on portfolios as a way to integrate assessment and instruction and to promote meaningful classroom learning. Many advocates of this function believe that a successful portfolio assessment program requires the ongoing involvement of students in the creation and assessment process. Portfolio design should provide students with the opportunities to become more reflective about their own work, while demonstrating their abilities to learn and achieve in academics.
For example, some feel it is important for teachers and students to work together to prioritize the criteria that will be used as a basis for assessing and evaluating student progress. During the instructional process, students and teachers work together to identify significant pieces of work and the processes required for the portfolio. As students develop their portfolio, they are able to receive feedback from peers and teachers about their work. Because of the greater amount of time required for portfolio projects, there is a greater opportunity for introspection and collaborative reflection. This allows students to reflect and report about their own thinking processes as they monitor their own comprehension and observe their emerging understanding of subjects and skills. The portfolio process is dynamic and is affected by the interaction between students and teachers.
Portfolio assessments can also serve summative assessment purposes in the classroom, serving as the basis for letter grades. Student conferences at key points during the year can also be part of the summative process. Such conferences involve the student and teacher (and perhaps the parent) in joint review of the completion of the portfolio components, in querying the cognitive processes related to artifact selection, and in dealing with other relevant issues, such as students' perceptions of individual progress in reaching academic outcomes.
The use of portfolios for large-scale assessment and accountability purposes pose vexing measurement challenges. Portfolios typically require complex production and writing, tasks that can be costly to score and for which reliability problems have occurred. Generalizability and comparability can also be an issue in portfolio assessment, as portfolio tasks are unique and can vary in topic and difficulty from one classroom to the next. For example, Maryl Gearhart and Joan Herman have raised the question of comparability of scores because of differences in the help students may receive from their teachers, parents, and peers within and across classrooms. To the extent student choice is involved, contents may even be different from one student to the next. Conditions of, and opportunities for, performance thus vary from one student to another.
These measurement issues take portfolio assessment outside of the domain of conventional psychometrics. The qualities of the most useful portfolios for instructional purposes–deeply embedded in instruction, involving student choice, and unique to each classroom and student–seem to contradict the requirements of sound psychometrics. However, this does not mean that psychometric methodology should be ignored, but rather that new ways should be created to further develop measurement theory to address reliability, validity, and generalizability.
See also: Assessment, subentries on Classroom Assessment, Dynamic Assessment.
Camp, Roberta. 1993. "The Place of Portfolios in Our Changing Views." In Construction versus Choice in Cognitive Measurement: Issues in Constructed Response, Performance Testing, and Portfolio Assessment, ed. Randy E. Bennett and William C. Ward. Hillsdale, NJ: Erlbaum.
Chen, Yih-Fen, and Martin, Michael A. 2000. "Using Performance Assessment and Portfolio Assessment Together in the Elementary Classroom." Reading Improvement 37 (1):32–37.
Cole, Donna H.; Ryan, Charles W.; and Kick, Fran. 1995. Portfolios Across the Curriculum and Beyond. Thousand Oaks, CA: Corwin.
Gearhart, Maryl, and Herman, Joan L. 1995. "Portfolio Assessment: Whose Work Is It? Issues in the Use of Classroom Assignments for Accountability." Evaluation Comment. Los Angeles: University of California, Center for the Study of Evaluation.
Graves, Donald H. 1992. "Portfolios: Keep a Good Idea Growing." In Portfolio Portraits, ed. Donald H. Graves and Bonnie S. Sunstein. Portsmouth, NH: Heinemann Educational Books.
Herman, Joan L.; Gearhart, Maryl; and Aschbacher, Pamela. 1996. "Portfolios for Classroom Assessment: Design and Implementation Issues." In Writing Portfolios in the Classroom, ed. Robert Calfee and Pamela Perfumo. Mahwah, NJ: Erlbaum.
Hewitt, Geof. 2001. "The Writing Portfolio: Assessment Starts with A." Clearing House 74 (4):187.
Lockledge, Ann. 1997. "Portfolio Assessment in Middle-School and High-School Social Studies Classrooms." Social Studies 88 (2):65–70.
Meadows, Robert B., and Dyal, Allen B. 1999. "Implementing Portfolio Assessment in the Development of School Administrators: Improving Preparation for Educational Leadership." Education 120 (2):304.
Murphy, Sandra M. 1997. "Who Should Taste the Soup and When? Designing Portfolio Assessment Programs to Enhance Learning." Clearing House 71 (2):81–85.
Stecher, Brian, and Herman, Joan L. 1997. "Using Portfolios for Large Scale Assessment." In Handbook of Classroom Assessment, ed. Gary Phye. San Diego, CA: Academic Press.
Wenzlaff, Terri L. 1998. "Dispositions and Portfolio Development: Is There a Connection?" Education 118 (4):564–573.
Wolf, Dennie P. 1989. "Portfolio Assessment: Sampling Student Work." Educational Leadership 46:35–39.
Joan L. Herman
Stephen A. Zuniga
Assessment is a process of gathering and documenting information about the achievement, skills, abilities, and personality variables of an individual.
Assessment is used in both an educational and psychological setting by teachers, psychologists, and counselors to accomplish a range of objectives. These include the following:
- to learn more about the competencies and deficiencies of the individual being tested
- to identify specific problem areas and/or needs
- to evaluate the individual's performance in relation to others
- to evaluate the individual's performance in relation to a set of standards or goals
- to provide teachers with feedback on effectiveness of instruction
- to evaluate the impact of psychological or neurological abnormalities on learning and behavior
- to predict an individual's aptitudes or future capabilities
In the early 2000s standardized tests are increasingly used to evaluate performance in U.S. schools. Faced with declining test scores by American students when compared to others around the world, state governments and the federal government have sought ways to measure the performance of schools and bring a measurable accountability to the educational process. Thus, states and the federal government have adopted standardized tests for evaluating knowledge and skills on the assumption that testing is an effective way to measure outcomes of education. One prominent program has been the No Child Left Behind Act that requires schools to meet certain performance standards annually, for their students as a group and also for individual ethnic and racial subgroups. The use of this type of standardized tests is controversial. Many educators feel that it limits the creativity and effectiveness of the classroom teacher and produces an environment of "teaching to the test."
The choice of an assessment tool depends on the purpose or goal of the assessment. Assessments might be made to establish rankings among individual students, to determine the amount of information students have retained, to provide feedback to students on their levels of achievement, to motivate students by recognizing and rewarding good performances, to assess the need for remedial education, and to evaluate students for class placement or ability grouping. The goal of the assessment should be understood by all stakeholders in the process: students, parents, teachers, counselors, and outside experts. An assessment tool that is appropriate for one goal is often inappropriate for another, leading to misuse of data.
Assessment tools fall broadly into two groups. Traditional assessments rely on specific, structured procedures and instructions given to all test-takers by the test administrator (or to be read by the test-takers themselves). These tests are either norm-referenced or criterion-referenced tests. Standardized tests allow researchers to compare data from large numbers of students or subgroups of students. Alternative assessments are often handled on an individual basis and offer students the opportunity to be more closely involved with the recognition of their progress and to discover what steps they can take to improve.
NORM-REFERENCED ASSESSMENTS In norm-referenced assessments, one person's performance is interpreted in relation to the performance of others. A norm-referenced test is designed to discriminate among individuals in the area being measured and to give each individual a rank or relative measure regarding how he or she performs compared to others of the same age, grade, or other subgroup. Often the mean, or average score, is the reference point, and individuals are scored on how much above or below the average they fall. These tests are usually timed. Norm-referenced tests are often used to tell how a school or school district is doing in comparison to others in the state or nation.
CRITERION-REFERENCED ASSESSMENTS A criterion-referenced assessment allows interpretation of a test-taker's score in relation to a specific standard or criterion. Criterion-referenced tests are designed to help evaluate whether a child has met a specific level of performance. The individual's score is based not on how he or she does in comparison to how others perform, but on how the individual does in relation to absolute expectations about what he or she is supposed to know. An example of a criterion-referenced test is a timed arithmetic test that is scored for the number of problems answered correctly. Criterion-referenced tests measure what information an individual has retained and they give teachers feedback on the effectiveness of their teaching particular concepts.
PERFORMANCE ASSESSMENT Performance assessment can be used to evaluate any learning that is skill-based or behavioral. Performance assessment requires the test-taker to perform a complex task that has to do with producing a certain product or performing a specific task. Performance assessments can be either individual or group-oriented and may involve application of real-life or workplace skills (for example, making a piece of furniture in wood shop).
AUTHENTIC ASSESSMENT Authentic assessment derives its name from the idea that it tests students in skills and knowledge needed to succeed in the real world. Authentic assessment focuses on student task performance and is often used to improve learning in practical areas. An advantage of authentic assessment is that students may be able to see how they would perform in a practical, non-educational setting and thus may be motivated to work to improve.
PORTFOLIO ASSESSMENT Portfolio assessment uses a collection of examples of the actual student's work. It is designed to advance through each grade of school with the student, providing a way for teachers and others to evaluate progress. One of the hallmarks of portfolio assessment is that the student is responsible for selecting examples of his or her own work to be placed in the portfolio. The portfolio may be used by an individual classroom teacher as a repository for work in progress or for accomplishments. Portfolios allow the teacher to evaluate each student in relation to his or her own abilities and learning style. The student controls the assessment samples, helping to reinforce the idea that he or she is responsible for learning and should have a role in choosing the data upon which he or she is judged. Portfolios are often shared by the student and teacher with parents during parent-teacher conferences.
INTERVIEW ASSESSMENT The assessment interview involves a one-on-one or small group discussion between the teacher and student, who may be joined by parents or other teachers. Standardized tests reveal little about the test-taker's thought process during testing. An interview allows the teacher or other administrator to gain an understanding of how the test-taker reached his or her answer. Individual interviews require a much greater time commitment on the part of the teacher than the administration of a standardized test to the entire class at one time. Thus, interviews are most effective when used to evaluate the achievements and needs of specific students. To be successful, interviews require both the teacher and the student to be motivated, open to discussion, and focused on the purpose of the assessment.
JOURNALS Journals have been used as part of the English curriculum since at least the 1980s. In assessment, the journal allows the student to share his or her thoughts on the learning process. A journal may substitute for or supplement a portfolio in providing a student-directed assessment of achievement and goals.
ATTITUDE INVENTORY Attitude is one component of academic success that is rarely measured objectively. An attitude inventory is designed to reveal both positive and negative (or productive and unproductive) aspects of a student's outlook toward school and learning. However, this type of assessment may be of limited use if the student's negative attitude makes him or her unwilling to actively participate in the assessment. By demonstrating a sincere interest in addressing student concerns that affect attitude, a school can improve the effectiveness of attitude inventory assessments.
COMPUTER-AIDED ASSESSMENT Computer-aided assessment is increasingly employed as a supplement to other forms of assessment. A key advantage in the use of computers is the capability of an interactive assessment to provide immediate feedback on responses. Students must be comfortable with computers and reading on a computer screen for these assessments to be successful.
Psychological assessment of children is used for a variety of purposes, including diagnosing learning disabilities and behavioral and attention problems. Psychologists can obtain information about a child in three general ways: observation, verbal questioning or written questionnaires, and assignment of tasks. The child's pediatrician, parents, or teacher may ask for psychological assessment to gain a greater understanding of the child's development and needs. There are many different psychological tests , and the psychologist must choose the ones that will provide the most relevant and reliable information in each situation. Often multiple tests are performed. However, most psychological assessments fall into one of three categories: observational methods, personality inventories, or projective techniques.
OBSERVATIONAL ASSESSMENT Observations are made by a trained professional either in a familiar setting (such as a classroom or playroom), an experimental setting, or during an office interview. Toys , dolls, or other items are often included in the setting to provide stimuli. The child may be influenced by the presence of an observer. However, researchers report that younger children often become engrossed in their activities and thus are relatively unaffected by the presence of an observer. Sometimes, for example, if attention deficit is suspected, several people are asked to observe the child under different circumstances: the teacher at school, the parent at home, and the psychologist in an office setting. Observational assessments are usually combined with other types of educational or psychological assessments when learning needs and behavioral problems are being evaluated.
PERSONALITY INVENTORIES A personality inventory is a questionnaire used with older children and adults that contains questions related to the subject's feelings or reactions to certain scenarios. One of the best-known personality inventories for people over age 16 is the Minnesota Multiphasic Personality Inventory (MMPI), a series of over 500 questions used to assess personality traits and psychological disturbances. Interviews or verbal questionnaires for personality assessment may be structured with a specific series of questions or be unstructured, allowing the subject to direct the discussion. Interviewers often use rating scales to record information during interviews.
PROJECTIVE TESTS A projective test asks the test-taker to interpret ambiguous situations. It requires a skilled, trained examiner to administer and interpret a projective test. The reliability of these tests with children is difficult to establish due to their subjective nature, with results varying widely among different examiners. One well-known projective test is the Rorschach Psycho-diagnostic Test, or inkblot test, first devised by the Swiss psychologist Hermann Rorschach in the 1920s. Another widely used projective test for people ages 14 to 40 is the Thematic Apperception Test (TAT), developed at Harvard University in the 1930s. In this test, the subject is shown a series of pictures, each of which can be interpreted in a variety of ways, and asked to construct a story based on each one. An adaptation administered to children aged three to ten is the Children's Apperception Test (CAT). Apperception tests are administered to children individually by a trained psychologist to assess personality, maturity, and psychological health.
ASSIGNMENT OF TASK ASSESSMENT Assignment of tasks is an assessment method involving the performance of a specific task or function. These tests are designed to inform the test administrator about attributes such as the test-taker's abilities, perceptions, and motor coordination. They can be especially helpful in assessing if there is a physical or neurological component that needs to be addressed medically or with occupational, speech, or physical therapy.
Assessment of children is challenging given the rapid changes in growth they experience during childhood. In childhood, it is difficult to ensure that the test-taker's responses will be stable for even a short time. Thus, psychologists, educators, and other test administrators are careful to take the stage of childhood into account when interpreting a child's test scores.
Traditional standardized tests rely on specific, structured procedures, which with young children presents some problems. Young children (preschool and early elementary years) do not have past experience and familiarity with tests and have limited understanding of the expectations of testing procedures. With young test-takers, the test administrator represents a significant factor that influences success. The child must feel comfortable with the test administrator and feel motivated to complete the test exercise. The administrator helps support the test-taker's attention to the test requirements. The testing environment affects all test-takers but may represent a more significant variable for the youngest test-takers.
One shortcoming of standardized testing is that it assumes that the same instrument can evaluate all students. Because most standardized tests are norm-referenced and measure a student's test performance against the performance of other test-takers, students and educators focus their efforts on the test scores, and schools develop curricula to prepare students to take the test. Other criticisms of standardized tests are that they are culturally insensitive and that they may not accurately represent the abilities of children in the United States for whom English is not their first language or who are not a part of mainstream American culture. Finally, in middle and high school settings, disgruntled students may inconspicuously sabotage their tests since these scores do not affect the students' own grades but reflect rather upon the competency of the teacher and the school administration.
Alternative assessments are subject to other concerns. Observer biases and inconsistencies have been identified through study of the assessment procedures. In the halo effect, the observer evaluates the child's behavior in a way that confirms his general previous impression of the child. For example, the observer believes a particular child is happy and loving. If, when the observer assesses that child, the child lays a doll face down on the table, the observer interprets this act as parenting behavior. On the other hand, if the observer believes the child is angry and hostile, when this child is observed laying the doll face down on the table, the observer may interpret the action as aggression. The expectations of the observer conveyed directly or through body language and other subtle cues may also influence how the child performs and how the observer records and interprets his or her observations. This observer bias can influence the outcome of an assessment.
Parents are justifiably concerned that their child be evaluated fairly and appropriately. They have the right to understand the purpose of the assessment, how it will be performed, how the information will be used, who will see the assessment results, and how the privacy of their child will be protected. Any professional performing an educational or psychological assessment should be willing discuss these concerns and to share the results of the assessment and their implications with the parent. Parents should be willing to share with examiners any information that might alter interpretation of the assessment results (for example, medical problems, cultural concerns).
When to ask for an assessment
Parents should request an assessment from the teacher whenever necessary to understand their child's progress, both in relation to expected grade-level expectations and performance in relation to other children in the class. Most schools and teachers offer parents many opportunities to discuss the assessment of their child. When teacher assessment indicates that a child has special needs or problems, the parent should request an evaluation by the school's child study team or an outside expert. Parents may also want to discuss appropriate assessments with their child's pediatrician and ask for an referral to a child psychologist or psychiatrist.
Authentic task assessment —Evaluation of a task performed by a student that is similar to tasks performed in the outside world.
Criterion-referenced test —An assessment that measures the achievement of specific information or skills against a standard as opposed to being measured against how others perform.
Halo effect —An observer bias in which the observer interprets a child's actions in a way that confirm the observer's preconceived ideas about the child.
Norm-referenced test —A test that measures the performance of a student against the performance of a group of other individuals.
Portfolio —A student-controlled collection of student work products that indicates progress over time.
Standardized test —A test that follows a regimented structure, and each individuals scores may be compared with those of groups of people. In the case of the Cognistat, test taker's scores can be compared to groups of young adults, middle-aged adults, the geriatric, and people who have undergone neurosurgery.
Task —A goal directed activity used in assessment.
See also California Achievement Tests (CAT); Children's Apperception Test (CAT); Development tests.
Carter, Phillip, and Ken Russell. Psychometric Testing: 1000 Ways to Assess Your Personality, Creativity, Intelligence, and Lateral Thinking. New York: Wiley & Sons, 2001.
Joint Committee on Standards for Educational Evaluation. The Student Evaluation Standards. Thousand Oaks, CA: Corwin Press, 2003.
Evaluation Center. 4405 Ellsworth Hall, Western Michigan University, Kalamazoo, MI 49008–5237. Web site: <www.wmich.edu/evalctr/jc>.
National Association for the Education of Young Children. 1509 16th Street, NW Washington, DC 20036. Web site: <www.naeyc.org>.
Tish Davidson, A.M.
Comprehensive Geriatric Assessment (CGA) is the term most commonly used to refer to the specialized process by which the health of some elderly people is assessed. CGA has four characteristics:
- It is multi-factorial, encompassing items traditionally regarded both as "medical" and "social."
- Its emphasis is on the functional ability of the person being assessed.
- It includes an inventory of both assets and deficits.
- It is action oriented, that is, it provides the basis for the subsequent management plan for the patient who is being assessed.
To consider this process in more detail, we can examine each of the items identified in the opening sentence: Some people, are assessed, by a specialized process. CGA is not meant for all elderly people, only some. Two features identify those individuals who might benefit from CGA: the person should have compromised function; and, they should have more than one thing wrong. Compromised function is key: people who are engaged in all activities in which they would like to be engaged, at a level that is fully satisfying for them, normally do not require CGA, even if they might have one or more medical illnesses, such as high blood pressure or osteoarthritis. But when elderly people find that they can no longer can perform certain activities necessary for them to remain independent, including things like looking after their household or getting dressed, then they become potential candidates for CGA. The other criterion for a person to become a candidate for CGA is to have more than one active medical problem that in some way is gives rise to, or appears to give rise to, the problem with function. To say that an elderly person has compromised function and multiple medical problems is another way of saying that that person is frail.
People with multiple problems require assessment of those problems. This assessment is in contrast to the usual medical approach, which begins with a diagnosis of the medical problem. Diagnosis is the process whereby clues from talking to (called taking the history) and examining the patient yield a pattern that is recognizable as having a single cause. Although more than one problem can be active at once, the traditional emphasis in medical diagnosis is on distilling many symptoms (what the patient tells the physician) and signs (what the physician finds on the examination) into a single cause, called the diagnosis.
The first practitioners of geriatric medicine recognized that this approach, while essential in sorting out the medical problems of frail elderly people, was inadequate in meeting their health needs. For example, many frail elderly people who are medically ill also are deconditioned —that is, they are weaker, especially in the shoulders and hips, more prone to fall, and more prone to abnormalities of fluid balance—but deconditioning is not a traditional medical diagnosis. Knowing how intensively to rehabilitate someone who is deconditioned in a hospital requires some understanding of their home circumstances: Will they have to climb stairs at home? Is there someone readily available to help? Is that person able and willing to help? Such practical methods fall outside the traditional domains of medical diagnoses, and their systematic inventory is what underlies the "assessment" process. Many authors believe that the term "assessment" has too narrow a focus, and that a proper assessment should not only give rise to a plan for addressing the problems thus identified, but should also include the management of the problems themselves, at least after they are stabilized. As a consequence, the term geriatric evaluation and management is sometimes preferred to describe what traditionally has been known as CGA.
Methods of CGA
The specialized nature of CGA lies in the systematic approach to a patient's problems. Although variation exists among practices, most methods of CGA include, in addition to an evaluation of the patient's medical diagnosis, an assessment of the following domains:
- Cognitive function. Problems that give rise to impairment of thinking, language, memory, and other aspects of cognition include syndromes such as dementia, delirium, and depression. Typically, cognition is screened using a brief instrument such as the Mini-Mental State Examination (MMSE). The MMSE tests several aspects of cognition, including memory, attention, concentration, orientation, language, and visual-spatial function. If this screening test detects an abnormality, then a more detailed evaluation is required.
- Emotion. The domain of emotion includes a screening of mood, to look for signs of depression, as well as an evaluation of common problems such as anxiety, or disorders of the mental state such as delusions or hallucinations. In addition, health attitudes are assessed, including the level of motivation, which is particularly important for patients who are being screened for participation in a rehabilitation program.
- Communication. Communication assessment typically includes a screening of vision, hearing, speech, and language.
- Mobility. The assessment of mobility that is, the ability to move about in bed, transfer in and or of bed, and walk is particularly important, as it is necessary for independence. In addition, because so many older people have atypical presentations of their illness, careful evaluation of their mobility as it first declines and then gets better allows clinicians to readily determine whether their patients are improving or getting worse. Given that many frail elderly people do not demonstrate the usual signs of sickness as they become ill (for example, they may not show an elevated temperature or white cell count when they have an infection), having a ready means to track illness progression and recovery is of great practical benefit, and careful assessment of mobility and balance allows this to be done.
- Balance. The assessment of balance is distinct from the assessment of mobility. Again, its importance lies both in its intrinsic value in relation to independence and in its value of improving or worsening health in the setting of acute illness.
- Bowel function. Bowel function is typically assessed by inquiring about the patient's bowel habit and by physical assessment, which should include a rectal examination.
- Bladder function. It is important to understand whether an older person is having difficulty with urination. In men, this often reflects disease of prostate. In either sex, the presence of urinary incontinence is of particular importance. As with problems in mobility and balance, the significance lies not just in the incontinence per se, but in incontinence as a sign of illness, within the genitourinary system and elsewhere.
- Nutrition. Interestingly, nutrition is often neglected in the traditional medical examination. It is important to assess the patient's weight and to note the presence of weight loss, and the time over which this weight loss has occurred. Routine laboratory investigations also offer some insight into an elderly person's nutritional status.
- Daily activities. In some ways this is at the heart of the assessment. It is extremely important to know whether older people are capable of fully caring for themselves in their particular setting. These activities traditionally are divided into "instrumental" activities of daily living, such as using a telephone, or doing shopping, caring for finances, and administering medications, and "personal" activities of daily living, such as bathing, dressing, or eating. Understanding where problems exist and how they presently are dealt with is essential to knowing how an illness impacts on an older person.
- Social situation. In addition to inquiring about the usual living circumstances, and whether there is a caregiver, the part of the assessment concerning social situation is the most distinct from the traditional medical examination. While it is clear that the patient enjoys primacy in the physician-patient relationship, it is also the case that the needs of the caregiver cannot be ignored. Indeed, where an older person is dependent in essential activities of daily living, the caregiver becomes the most important asset to the maintenance of independence. It is therefore essential to understand how caregivers feel about their caring role, and whether, and under what circumstances, they can see themselves continuing in it.
The efficacy of CGA has been formally tested in a number of randomized, controlled trials, so that it now forms part of evidence-based medicine. These trials have shown that, compared with usual care, elderly people—especially those who are frail—achieve many important health outcomes when provided with CGA-based care. For example, they are more likely to be discharged from the hospital without delay, more likely to be functional when discharged and up to a year later, less likely to go to a nursing home, and less likely to die within two years of follow-up.
A thorough CGA, including the standard history and physical examination, typically takes between an hour and an hour and a half to complete, and it can take even longer. This is more than twice the length of many initial consultations with a clinician, and so a CGA requires special effort and commitment on everyone's part. Nevertheless, it represents a reasonable way to come to grips with the needs particularly of frail older people, and in consequence to set appropriate and achievable goals to maintain independence, or to otherwise intervene for the benefit of the patient.
See also Balance and Mobility; Day Hospitals; Frailty; Functional Ability; Geriatric Medicine; Multidisciplinery Team; Surgery in Elderly People.
Philip, I., ed. Assessing Elderly People in Hospital and Community Care London: Farrand Press, 1994.
Rockwood, K.; Silvius, J.; and Fox, R. "Comprehensive Geriatric Assessment: Helping Your Elderly Patients Maintain Functional Well-being." Postgraduate Medicine 103 (1998): 247–264.
Rockwood, K.; Stadnyk, K.; Carver, D.; Mac-Pherson, K.; Beanlands, H. E.; Powell, C.; Stolee, P.; Thomas, V. S.; and Tonks, R. S. "A Clinimetric Evaluation of Specialized Geriatric Care for Frail Elderly People." Journal of the American Geriatric Society 48 (9) 2000: 1080–1085.
The process by which the financial worth of property is determined. The amount at which an item is valued. A demand by the board of directors of a corporation for the payment of any money that is still owed on the purchase of capital stock. The determination of the amount of damages to be awarded to a plaintiff who has been successful in a lawsuit. The ascertainment of the pro rata share of taxes to be paid by members of a group of taxpayers who have directly benefited from a particular common goal or project according to the benefit conferred upon the individual or his or her property. This is known as a special assessment. The listing and valuation of property for purposes of fixing a tax upon it for which its owner will be liable. The procedure by which theinternal revenue service, or other government department of taxation, declares that a taxpayer owes additional tax because, for example, the individual has understated personal gross income or has taken deductions to which he or she is not entitled. This process is also known as a deficiency assessment.
1. the first stage of the nursing process, in which data about the patient's health status is collected and from which a nursing care plan may be devised.
2. an examination set by an examining body to test a candidate's theoretical and practical nursing skills.