The purpose of achievement testing is to measure some aspect of the intellectual competence of human beings: what a person has learned to know or to do. Teachers use achievement tests to measure the attainments of their students. Employers use achievement tests to measure the competence of prospective employees. Professional associations use achievement tests to exclude unqualified applicants from the practice of the profession. In any circumstances where it is necessary or useful to distinguish persons of higher from those of lower competence or attainments, achievement testing is likely to occur.
The varieties of intellectual competence that may be developed by formal education, self-study, or other types of experience are numerous and diverse. There is a corresponding number and diversity of types of tests used to measure achievement. In this article attention will be directed mainly toward the measurement of cognitive achievements by means of paper and pencil tests. The justifications for this limitation are (1) that cognitive achievements are of central importance to effective human behavior, (2) that the use of paper and pencil tests to measure these achievements is a comparatively well-developed and effective technique, and (3) that other aspects of intellectual competence will be discussed in other articles, such as those on motivation, learning, attitudes, leadership, aesthetics, and personality.
Measurability of achievement. Despite the complexity, intangibility, and delayed fruition of many educational achievements and despite the relative imprecision of many of the techniques of educational measurement, there are logical grounds for believing that all important educational achievements can be measured. To be important, an educational achievement must lead to a difference in behavior. The person who has achieved more must in some circumstances behave differently from the person who has achieved less. If such a difference cannot be observed and verified no grounds exist for believing that the achievement is important.
Measurement, in its most fundamental form, requires nothing more than the verifiable observation of such a difference. If person A exhibits to any qualified observer more of a particular trait than person B, then that trait is measurable. By definition, then, any important achievement is potentially measurable.
Many important educational achievements can be measured quite satisfactorily by means of paper and pencil tests. But in some cases the achievement is so complex, variable, and conditional that the measurements obtained are only rough approximations. In other cases the difficulty lies in the attempt to measure something that has been alleged to exist but that has never been defined specifically. Thus, to say that all important achievements are potentially measurable is not to say that all those achievements have been clearly identified or that satisfactory techniques for measuring all of them have been developed.
Achievement, aptitude, and intelligence tests. Achievement tests are often distinguished from aptitude tests that purport to predict what a person is able to learn or from intelligence tests intended to measure his capacity for learning. But the distinction between aptitude and achievement is more apparent than real, more a difference in the use made of the measurements, than in what is being measured. In a very real sense, tests of aptitude and intelligence are also tests of achievement.
The tasks used to measure a child’s mental age may differ from those used to measure his knowledge of the facts of addition. The tasks used to assess a youth’s aptitude for the study of a foreign language may differ from those used to assess his knowledge of English literature. But all of these tasks test achievement; they measure what a person has learned to know or to do. All learning except the very earliest builds on prior learning. Thus, what is regarded as achievement in retrospect is regarded as aptitude when looking to the future.
There may well be differences in genetically determined biological equipment for learning among normal human beings. But no method has yet been discovered for measuring these differences directly. Only if one is willing to assume that several persons have had identical opportunities, incentives, and other favorable circumstances for learning (and that is quite an assumption) is it reasonable to use present differences in achievements as a basis for dependable estimates of corresponding differences in native ability to learn.
Types of tests. Although some achievement testing is done orally, with examinee and examiner face to face, most of it makes use of written tests. Of these written tests there are two main types: essay and objective. If the test consists of a relatively small number of questions or directions in response to which the examinee writes a sentence, a paragraph, or a longer essay of his own composition, the test is usually referred to as an essay test. Alternatively, if the test consists of a relatively large number of questions or incomplete statements in response to which the examinee chooses one of several suggested answers, the test is ordinarily referred to as an objective test.
Objective tests can be scored by clerks or scoring machines. Essay tests must be scored by judges who have special qualifications and who sometimes are specially trained for the particular scoring process. The scores obtained from objective tests tend to be more reliable than those obtained from essay tests. That is, independent scorings of the same answers, or of the same person’s answers to equivalent sets of questions, tend to agree more closely in the case of objective tests than in the case of essay tests.
There are four major steps in achievement testing: (1) the preparation or selection of the test, (2) the administration of the test to the examinees, (3) the scoring of the answers given, and (4) the interpretation of the resulting scores.
Test development. In the United States, and to a lesser extent in other countries, achievement tests have been developed and are offered for sale by commercial test publishers. Buros (1961) has provided a list of tests in print and has indicated where they may be obtained. Recent catalogs of tests are available from most of the publishers listed in that volume.
Most achievement tests, however, are prepared for limited, specific use by teachers, professors, test committees, or test specialists. These test constructors usually start with some fairly well-defined notions of the reasons for testing. Their purposes, and their acquaintance with the theory and principles of achievement testing, lead them to select certain abilities and areas of knowledge to test, certain types of test items, and certain procedures for test administration and scoring.
Mental abilities. Too little is known about the mind and how it works to permit clear identification of distinct and relatively independent mental abilities or mental processes. Thus, while it is easy to say that a good achievement test should sample in due proportions all the mental abilities related to that achievement, it is much more difficult to speak clearly about the nature and unique characteristics of these supposedly distinct mental abilities. Terms like recognition, recall, problem solving, critical thinking, and evaluative judgment have been used in referring to such abilities or mental processes. But although these terms obviously refer to somewhat different kinds of tasks, the evidence that they involve independent and distinct mental processes is practically nonexistent. Hence, the test constructor may be well advised to avoid claims or speculations about the mental abilities he is testing.
Among specialists in test construction there is general preference for test questions that require more than the recall of factual details: for test questions that require thought. This has led many test constructors to avoid simple true-false or completion types of test items in favor of more complex test situations that call for problem solving, critical interpretation, or evaluative judgment. Sometimes the test constructor supplies as background information much of the factual information the examinee is likely to need to answer the questions successfully. Then, presumably, a student’s score depends more on his ability to use knowledge than on his ability to recall it.
It is easier to specify areas of knowledge than mental abilities, but test constructors often face problems in determining just which areas a particular achievement test ought to sample, and with what relative weights. Sometimes textbooks, courses of study, and other sources are studied for guidance as to which areas of content deserve inclusion or emphasis. But in the last analysis, problems of this sort must be resolved on the basis of the judgments of the test constructor or test construction committee. Since different test constructors may not agree on these judgments, two achievement tests bearing the same titles may cover some-what different areas of content.
Types of questions. A wide variety of different types of questions may be used in essay tests of achievement. Here are some examples.
List the similarities and differences of the eye and the camera.
What is the relation between the boiling point of water and the atmospheric pressure on the surface of the water?
Explain the electrolysis of water.
What is meant by the term “cultural lag”?
What internal weaknesses and external forces led to the fall of the Roman Empire?
Objective test items also differ widely. Two of the more common types are illustrated here.
True-False. Weather systems affecting the State of Illinois usually approach from the southeast rather than from the northwest. (False)
Multiple-choice. Why is chlorine sometimes added to water in city water supply systems?
1. To clarify the water (dissolve sediment).
* 2. To kill bacteria.
3. To protect the pipes from corrosion.
4. To remove objectionable odors.
Other types of objective test items are illustrated and discussed in various books on achievement test construction. Some of these are listed at the end of this article.
The choice between essay or objective tests, or among different types of objective test items, is often made on the assumption that each of these types is particularly well adapted, or poorly adapted, to the measurement of a particular ability or mental process. But available evidence does more to question this assumption than to support it. It is true that certain questions or tasks used in achievement testing fall more naturally into one type of test or test item than into another. But on the whole, types of test items appear to be general rather than specific in function. Whatever educational achievement that can be measured well using one type of test item can probably also be measured quite well using some other types. How well the achievement is measured seems to depend less on the type of item chosen than on the skill with which it is used.
Test administration. Most tests of educational achievement are given to groups rather than individuals. In either case, effective administration requires (1) that examinees be motivated to do as well as they can, (2) that they understand clearly what the test requires them to do, (3) that the environment in which they work allows and encourages their best efforts, and (4) that each examinee has an equal chance to demonstrate his achievement.
Examinees are usually motivated to do their best on an achievement test because of the present rewards and future opportunities that depend on the quality of their performance. It is possible for an examinee to be so highly motivated that his anxiety actually interferes with his best performance on a test. Some examinees report that they never do themselves justice on a written examination because of the emotional upset they suffer or because of some deficiency in test-taking skills. But the evidence suggests that these problems afflict persons of low achievement far more often than those of high achievement.
Cheating. Cheating is a perennial problem in achievement testing. The problem could be alleviated, as some have suggested, by reducing emphasis on, and rewards for, achievement; but there are obvious disadvantages in this solution. A better solution, in general, is to provide sufficient supervision during the test administration to discourage attempts to cheat and to deal with those who do cheat firmly enough to make cheating quite unattractive. In some cases cheating on school and college achievement tests has been discouraged effectively by cultivation of honor systems, in which the students themselves take responsibility for honest examination behavior and for reporting any instances of cheating.
Time limits. The current trend in achievement testing is to avoid “speed tests.” Examinees differ widely in their rates of work, so that the slowest may require twice as long as the fastest to complete a test to his satisfaction. There is usually a positive correlation between rate of work and correctness of response, that is, examinees who know the most answers tend to give them the most quickly. But the correlation is not high enough to allow rate-of-work scores to add appreciable valid information to correct-answer scores as measures of achievement. Even among the most capable examinees there may be wide differences in rate of work. Hence, the most accurate predictions of subsequent achievement can usually be made when tests have time limits generous enough to allow most examinees to finish.
Scoring. Each type of test presents unique problems of scoring.
Essay test scoring. Essay test scoring calls for higher degrees of competence, and ordinarily takes considerably more time, than the scoring of objective tests. In addition to this, essay test scoring presents two special problems. The first is that of providing a basis for judgment that is sufficiently definite, and of sufficiently general validity, to give the scores assigned by a particular reader some objective meaning. To be useful, his scores should not represent purely subjective opinions and personal biases that equally competent readers might or might not share. The second problem is that of discounting irrelevant factors, such as quality of handwriting, verbal fluency, or gamesmanship, in appealing to the scorer’s interests and biases. The reader’s scores should reflect unbiased estimates of the essential achievements of the examinee.
One means of improving objectivity and relevancy in scoring essay tests is to prepare an ideal answer to each essay question and to base the scoring on relations between examinee answers and the ideal answer. Another is to defer assignment of scores until the examinee answers have been sorted and resorted into three to nine sets at different levels of quality. Scoring the test question by question through the entire set of papers, rather than paper by paper (marking all questions on one paper before considering the next) improves the accuracy of scoring. If several scorers will be marking the same questions in a set of papers, it is usually helpful to plan a training and practice session in which the scorers mark the same papers, compare their marks and strive to reach a common basis for marking.
Objective test scoring. Answers to true–false, multiple-choice, and other objective-item types can be marked directly on the test copy. But scoring is facilitated if the answers are indicated by position marking a separate answer sheet. For example, the examinee may be directed to indicate his choice of the first, second, third, fourth, or fifth alternative to a multiple-choice test item by blackening the first, second, third, fourth, or fifth position following the item number on his answer sheet.
Answers so marked can be scored by clerks with the aid of a stencil key on which the correct answer positions have been punched. To get the number of correct answers, the clerk simply counts the number of marks appearing through the holes on the stencil key. Or the answers can be scored, usually much more quickly and accurately, by electrical scoring machines. Some of these machines, which “count” correct answers by cumulating the current flowing through correctly placed pencil marks, require the examinee to use special graphite pencils; others, which use photoelectric cells to scan the answer sheet, require only marks black enough to contrast sharply with the lightly printed guide lines. High-speed photoelectric test scoring machines usually incorporate, or are connected to, electronic data processing and print-out equipment.
Correction for guessing. One question that often arises is whether or not objective test scores should be corrected for guessing. Differences of opinion on this question are much greater and more easily observable than differences in the accuracy of the scores produced by the two methods of scoring. If well-motivated examinees take a test that is appropriate to their abilities, little blind guessing is likely to occur. There may be many considered guesses, if every answer given with less than complete certainty is called a guess. But the examinee’s success in guessing right after thoughtful consideration is usually a good measure of his achievement.
Since the meaning of most achievement test scores is relative, not absolute—the scores serve only to indicate how the achievement of a particular examinee compares with that of other examinees—the argument that scores uncorrected for guessing will be too high carries little weight. Indeed, one method of correcting for guessing results in scores higher than the uncorrected scores.
The logical objective of most guessing correction procedures is to eliminate the expected advantage of the examinee who guesses blindly in preference to omitting an item. This can be done by subtracting a fraction of the number of wrong answers from the number of right answers, using the formula S = R – W/(k – 1) where S is the score corrected for guessing, R is the number of right answers, W is the number of wrong answers, and k is the number of choices available to the examinee in each item. An alternative formula is S = R + O/k where O is the number of items omitted, and the other symbols have the same meaning as before. Both formulas rank any set of examinee answer sheets in exactly the same relative positions, although the second formula yields a higher score for the same answers than does the first.
Logical arguments for and against correction for guessing on objective tests are complex and elaborate. But both these arguments and the experimental data point to one general conclusion. In most circumstances a correction for guessing is not likely to yield scores that are appreciably more or less accurate than the uncorrected scores.
Score interpretation. It is possible to prepare an achievement test on which the scores have absolute meaning. For example, scores on a test of ability to add pairs of single-digit numbers can be used to estimate how many of the 100 addition facts an examinee knows. Or, if a test of word meanings is built by systematically sampling the words in a particular dictionary and systematically mixing words and definitions to produce test items, the test can be used to estimate what portion of the words in that dictionary an examinee understands, in one sense of that term.
But most achievement tests are not constructed so systematically nor based on such clearly defined universes of knowledge. Scores on most achievement tests, therefore, are interpreted in relative terms. Whether an examinee’s score on such a test is regarded as good or poor depends on whether most of his presumed peers scored lower or higher on the same test.
Several statistical techniques may be used to aid in score interpretation on a relative basis. One of these is the frequency distribution of a set of test scores. Each score in the set is tallied on a scale extending from the highest to the lowest scores. One can then tell by visual inspection whether a particular score is high, medium, or low relative to other scores in this distribution.
Percentile ranks. The information contained in a frequency distribution of scores can be quantified by calculating a corresponding percentile rank for each possible score in the total range of scores. The percentile rank of a particular score indicates what percentage of the scores in the given set (or in a hypothetical population of which the given set is a sample) are lower than the particular score. Percentile ranks can range from 0 to 100. They are easy to interpret, but they do not preserve all of the information on relative achievements available in the original set of scores, nor do they reflect these relative achievements with perfect fidelity.
Standard score systems. Measures of the average score value and of score dispersion are often used as aids to score interpretation. The measure of average value most commonly used in the arithmetic mean, defined as the sum of all scores divided by the number of scores. The measure of dispersion most commonly used is the standard deviation, the square root of the arithmetic mean of the squared deviations of the scores from their own mean. In the set of scores 1, 2, 3, 4, and 5, the mean score is 3 and the standard deviation is 1.414.
The mean and standard deviation can be used to transform the scores in any set into standard scores having a predetermined mean and standard deviation. One type of standard score is the z score. If the mean of a set of scores is subtracted from a particular score, and if the resulting difference is divided by the standard deviation, a z score is obtained. When z scores are obtained for an entire set of scores, the new z distribution has a mean of 0, a standard deviation of 1, and most of the scores fall within the range —3 to +3. The z scores corresponding to the scores 1, 2, 3, 4, and 5 are -1.4, -0.7, 0, +0.7, and +1.4.
To avoid negative scores and decimals, z scores may be multiplied by 10 and added to 50. This set of operations provides another type of standard score whose mean is 50 and whose standard deviation is 10. Single-digit standard scores, ranging from 1 to 9, with a mean of 5 and a standard deviation of 2 are called stanines (standard nines). Various other types of standard scores are in use. In stanines and some other standard score systems, the distribution of raw scores is not only converted to a standard scale but is also transformed into a normal distribution.
The special value of standard scores of the types just discussed is that each of them has a clearly defined relative meaning. Standard scores of a particular type for different tests are comparable when based on scores from the same group of examinees. That is, a particular standard score value indicates the same degree of relative excellence or deficiency in the group of examinees, regardless of the test to which it applies.
Reliability. Proper interpretation of an achievement test score requires, in addition to knowledge of its absolute and relative meanings, some perception of its precision and of its relations to other significant measurements. Achievement test specialists use coefficients of reliability as measures of precision.
A reliability coefficient is a coefficient of correlation between two sets of test scores. Often this is obtained when a particular group of examinees provides scores on two equivalent tests. If equivalent tests are not available, or cannot be administered conveniently, reliability may be estimated by readministering the same test after an interval of time. Alternatively, and preferably in most circumstances, a test may be split into two or more parts that are more or less equivalent. The correlations obtained between scores on the parts may be used as a basis for calculating the reliability coefficient. Reliability coefficients obtained from equivalent forms of a test are sometimes referred to as coefficients of equivalence. Those obtained by splitting a single test are known as coefficients of internal consistency. Equivalence or internal consistency in tests is often referred to as “homogeneity.” Correlations obtained by readministering the same test are called coefficients of stability.
In most situations a good achievement test will have a reliability coefficient of .90 or higher. The reliability coefficient of a test depends on a number of factors. Reliability tends to be high if (1) the range of achievements in the group tested is broad, (2) the area of achievement covered by the test is narrow, (3) the discriminating power of the individual items is high, and (4) the number of items included in the test is large. Only the last two of these factors are ordinarily subject to control by the test constructor.
Discriminating power. The discriminating power of a test item can be measured by the difference between scores on that item for examinees of high and low achievement. To obtain the clearest contrast between these two levels of achievement, examinees whose test scores place them among the top 27 per cent are placed in the high group and those whose scores fall in the bottom 27 per cent are placed in the low group. Extreme groups of upper and lower quarters, or upper and lower thirds, are almost equally satisfactory. The difference between the two groups’ total scores on the item, divided by the maximum possible value of that difference, yields an index of discrimination.
Good achievement test items have indexes of discrimination of .40 or higher. Items having indexes of .20 or lower are of questionable value. If the discrimination index is near zero, or even negative, as it sometimes may be, the test can be improved by omitting the item, even though this means shortening the test. Sometimes it is possible to revise items that are low in discrimination to remove errors or ambiguities or to make the level of difficulty more appropriate. Items that nearly everyone answers correctly, or nearly everyone misses, are certain to be low in discrimination. Discrimination indexes based on small groups of examinees are likely to be quite unreliable, but even unreliable data provide some basis for test improvement.
Standard error of measurement. Another measure of precision or accuracy in test scores is the standard error of measurement. The standard error of measurement depends on the standard deviation of the test scores and on their reliability. It may be calculated from the formula
in which σmens indicates the standard error of measurement, σt indicates the standard deviation of the test scores, and rtt represents the reliability coefficient of the test scores. About two-thirds of the scores in a given set differ from the ideal true score values by less than one standard error of measurement. The other one-third, of course, differ from the corresponding true scores by more than one standard error of measurement. A true score is defined as the mean of the scores that would be obtained in an infinite number of samples of tests equivalent to the given test.
The test having the smallest standard error of measurement is not necessarily the best test, since good tests yield large score standard deviations and this, in turn, tends to be associated with large standard errors of measurement. Hence, it is better to use the standard error of measurement as an indication of the degree of accuracy of a particular test score, rather than as a measure of the ability of the test to differentiate among various levels of achievement.
Validity. The reliability coefficient of a test shows how precisely it measures whatever it does measure. In contrast, the validity coefficient is sometimes said to show how precisely it measures what it ought to measure, or what it purports to measure. But since good criterion scores, i.e., actual measures of that which the test ought to measure, are seldom available to the constructor of an achievement test, the practical value of this concept of predictive validity is limited.
Knowledge of what other measures the test scores are related to, that is, what they correlate with, adds to the test constructor’s knowledge of what the test is measuring. In this sense these correlations contribute to understanding of the concurrent validity of the test. But for most achievement tests, validity is primarily a matter of operational definition, or content or face validity, and only secondarily, if at all, a matter of empirical demonstration. Validity must be built into most achievement tests. The content to be covered by an achievement test and the tasks to be used to indicate achievement are best determined by a consensus of experts. Experience and experiments shed light on some of the issues that these experts may debate, but there is no good substitute for their expertness, their values, and their experience as bases for valid achievement test construction.
Importance and limitations. Achievement tests play important roles in education, in government, in business and industry, and in the professions. If they were constructed more carefully and more expertly, and used more consistently and more wisely, they could do even more to improve the effectiveness of these enterprises.
But achievement tests also have limitations beyond those attributable to hasty, inexpert construction or improper use. In the first place, they are limited to measuring a person’s command of the knowledge that can be expressed in verbal or symbolic terms. This is a very large area of knowledge, and command of it constitutes a very important human achievement; but it does not include all knowledge, and it does not represent the whole of human achievement. There is, for example, the unverbalized knowledge obtained by direct perceptions of objects, events, feelings, relationships, etc. There are also physical skills and behavioral skills, such as leadership and friendship, that are not highly dependent on command of verbal knowledge. A paper and pencil test of achievement can measure what a person knows about these achievements but not necessarily how effectively he uses them in practice.
In the second place, while command of knowledge may be a necessary condition for success in modern human activities, it is by no means a sufficient condition. Energy, persistence, and plain good fortune, among other things, combine to determine how successfully he uses the knowledge he possesses. A person with high achievement scores is a better bet to succeed than one with low achievement scores, but high scores cannot guarantee success.
Robert L. Ebel
Adkins, Dorothy C. 1948 Construction and Analysis of Achievement Tests: The Development of Written and Performance Tests of Achievement for Predicting Job Performance of Public Personnel. Washington: Government Printing Office.
American Educational Research Association, Committee on Test Standards 1955 Technical Recommendations for Achievement Tests. New York: The Association.
American Psychological Association 1954 Technical Recommendations for Psychological Tests and Diagnostic Techniques. Washington: The Association.
Buros, Oscar K. (editor) 1961 Tests in Print: A Comprehensive Bibliography of Tests for Use in Education, Psychology, and Industry. Highland Park, N.J.: Gryphon.
Ebel, Robert L. 1965 Measuring Educational Achievement. Englewood Cliffs, N.J.: Prentice-Hall.
Furst, Edward J. 1958 Constructing Evaluation Instruments. New York: Longmans.
Gerberich, Joseph R. 1956 Specimen Objective Test Items: A Guide to Achievement Test Construction. New York: Longmans.
Lindquist, Everet F. (editor) 1951 Educational Measurement. Washington: American Council on Education.
Travers, Robert M. W. 1950 How to Make Achievement Tests. New York: Odyssey.
Wood, Dorothy A. 1960 Test Construction: Development and Interpretation of Achievement Tests. Columbus, Ohio: Merrill.