Aptitude tests constitute one of the most widely used types of psychological tests. The term “aptitude” is often used interchangeably with the term “ability.”
The concept of ability. An ability refers to a general trait of an individual that may facilitate the learning of a variety of specific skills. For example, the level of performance that a man attains in operating a turret lathe may depend on the level of his abilities of manual dexterity and motor coordination, but these abilities may be important to proficiency in other tasks as well. Thus, manual dexterity also is needed in assembling electrical components, and motor coordination is needed to fly an airplane. In our culture, verbal abilities are important in a very wide variety of tasks. The individual who has a great many highly developed abilities can become proficient at a great number of different tasks. The concept of “intelligence” really refers to a combination of certain abilities that contribute to achievement in a wide range of specific activities. The trend in aptitude testing is to provide measures of separate abilities. The identification of these separate abilities has been one of the main areas of psychological research, and it is this research that provides the basis of many aptitude tests.
Psychological tests are essentially standardized measures of a sample of an individual’s behavior. Any one test samples only a limited aspect of behavior. By analogy, the chemist, by testing only a few cubic centimeters of a liquid, can infer the characteristics of the compound; the quality control engineer does not test every finished product but only a sample of them. Similarly, the psychologist may diagnose an individual’s “vocabulary” from a measure based on a small number of words to which he responds, or he may infer the level of a person’s “multilimb coordination” by having him make certain movements. The most important feature of this sample of behavior is that it is taken under certain controlled conditions. Performance on just any sample of words, for example, is not diagnostic of “vocabulary.” For a behavior sample to qualify as a psychological test, its adequacy must be demonstrated quantitatively. (Some typical indexes for doing this will be described below.)
How abilities are identified. Some individuals who perform well on verbal tasks (for example, those tasks requiring a large vocabulary) may do poorly on tasks requiring spatial orientation (for example, flying an airplane). Or an individual who performs well on verbal items may do poorly on numerical items. Consequently, it is obvious that there are a number of different abilities that distinguish people. But how are the great variety of abilities identified? How does the psychologist know what abilities are to be usefully considered separate from one another? The basic research technique that has been used is called factor analysis. A large number of tests, selected with certain hypotheses in mind, are administered to a large number of experimental subjects. Correlation coefficients among all these test performances are then computed. From these correlations, inferences are made about the common abilities needed to perform the tests. The assumption is that tests that correlate with each other measure the same ability factor, and tests that are uncorrelated measure different factors. The problem of extracting and naming these factors is somewhat complex. Examples of separate abilities that have been identified are verbal comprehension, spatial orientation, perceptual speed, and manual dexterity. Of course, this basic research also allows assessment of the kinds of tests that provide the best measures of the different ability factors.
Aptitudes and abilities. Ability tests are usually given with the objective of making some prediction about a person’s future success in some occupational activity or group of activities. The term aptitude, used in place of the term ability, has more of a predictive connotation. We could, of course, use such tests solely to attain a picture of a person’s strong and weak ability traits, with no specific predictive objective. We could use such measures as variables in psychological research, for example, studies of psychological development or the relation of ability to learning. Or we may be interested in the discovery of the relation between the ability of spatial relations and the speed of learning a perceptual-motor skill. But most often these tests are used in personnel selection, vocational guidance, or for some other applied predictive purpose such as using a spatial relations test to select turret lathe operators.
Sometimes aptitude tests designed to predict success in some specific job or occupation, as would be true of a test of “clerical aptitude,” actually measure combinations of different abilities (e.g., perceptual speed, numerical facility) found to be important in clerical jobs.
Achievement tests. Aptitude tests are distinguished from achievement (or proficiency) tests, which are designed to measure degree of mastery of an area of knowledge, of a specific skill, or of a job. Thus, a final examination in a course is an achievement test used to assess student status in the course. If used to predict future performance in graduate work or in some other area, it would be called an aptitude test. The distinction between aptitude and achievement tests is often in terms of their use.
Ways of describing aptitude tests. Tests may be classified in terms of the mode in which they are presented, whether they are group or individual tests, whether they are speeded, and in terms of their content. Any complete description of an aptitude test should include reference to each of these characteristics.
Mode of presenting tests. Most tests are of the paper and pencil variety, in which the stimulus materials are presented on a printed page and the responses are made by marking a paper with a pencil. The administrative advantages of such a medium are obvious, in that many individuals can be tested at once, fewer examiners are needed, and scoring of the tests is relatively straightforward. Nonprinted tests, such as those involving apparatus, often present problems of maintenance and calibration. However, it may not be possible to assess the desired behavior by means of purely printed media. Tests of manual dexterity or multiple-limb coordination are examples of aptitude tests requiring apparatus, varying from a simple pegboard to mechanical-electronic devices. Tests for children and for illiterates frequently employ blocks and other objects, which are manipulated by the examinee.
Auditory and motion picture media have also been used in aptitude testing. For example, tests of musical aptitude are auditory, as are certain tests designed to select radiotelegraphers. The test material is presented by means of a phonograph or tape recorder. One such test was designed to measure how well individuals could estimate the relative velocity of moving objects. It is evident that this function could not have been measured by a purely printed test. However, in both these auditory and motion picture tests, the responses are, nonetheless, recorded by pencil on paper.
Group versus individual tests. Some tests can be administered to examinees in a group; others can be administered to only one person at a time. The individual test is naturally more expensive to use in a testing program. Tests for very young children or tests requiring oral responses must be individual tests. Such tests are also used when an individual’s performance must be timed accurately. Devices used to test motor abilities constitute additional examples of individual tests, although sometimes it is possible to give these in small groups.
Speeded versus nonspeeded tests. Tests differ in the emphasis placed on speed. In many functions, such as vocabulary, there is little interest in speed. Such tests are called power tests and have no time limits. For other functions, such as perceptual speed or finger dexterity, speed becomes an important factor in the measured behavior. Speeded tests may be administered by allowing all examinees a specific length of time to finish (time-limit tests), in which case the score is represented by the number of items correctly completed. Alternatively, a speeded test may require the examinee to finish a task as rapidly as possible (work-limit tests), and his score may then be expressed as the time taken to complete the test. For example, a finger dexterity test may be scored in terms of the number of seconds taken to complete a series of small screw-washer-nut assemblies.
What the tests measure. Most frequently, aptitude tests are classified in terms of what they attempt to measure. Thus, there are vocabulary tests, motor ability tests, etc. Figure 1 provides some examples of test items.
Tests containing items such as those illustrated are often grouped into standard “multiple aptitude test batteries,” which provide profiles of certain separate ability test scores. Examples are the Differential Aptitude Tests (DAT), published by the Psychological Corporation, the General Aptitude Test Battery (GATB) of the U.S. Employment Service, and the Aircrew Classification Battery of the U.S. Air Force.
Characteristics of useful tests. Now that we have looked briefly at the different forms of tests, let us examine some of the basic concepts of testing. How can the usefulness of a test be evaluated?
Test construction. The process of constructing aptitude tests involves a rather technical sequence combining ingenuity of the psychologist, experimentation and data collection with suitable samples of individuals, the calculation of quantitative indexes for items and total test scores, and the application of appropriate statistical tests at various stages of test development. Some of the indexes applied in the construction phase are difficulty levels, the proportion of responses actually made to the various alternatives provided in multiple-choice tests, and the correlation of item scores with total test scores or within an independent criterion. A well-developed aptitude test goes through several cycles of these evaluations before it is even tried out as a test. The more evidence there is in the test manual for such rigorous procedure the more confidence we can have in the tests.
There are other problems that generally must be considered in evaluating test scores. Before a test is actually used, a number of conditions have to be met. There is a period of “testing the tests” to determine their applicability in particular situations. A test manual should be devised to provide information on this. Furthermore, there is the question of interpreting a test score.
Standardization. The concept of standardization refers to the establishment of uniform conditions under which the test is administered, ensuring that the particular ability of the examinee is the sole variable being measured. A great deal of care is taken to insure proper standardization of testing conditions. Thus, the examiner’s manual for a particular test specifies the uniform directions to be read to everyone, the exact demonstration, the practice examples to be used, and so on. The examiner tries to keep motivation high and to minimize fatigue and distractions. If such conditions are high for one group of job applicants and not for another, the test scores may reflect motivational differences in addition to the ability differences that it is desired to measure.
Norms. A test score has no meaning by itself. The fact that Joe answered 35 words correctly on a vocabulary test or that he was able to place 40 pegs in a pegboard in two minutes gives very little information about Joe’s verbal ability or finger dexterity. These scores are known as raw scores. In order to interpret Joe’s raw score it is necessary to compare it with a distribution of scores made by a large number of other individuals, of known categories, who have taken the same test. Such distributions are called norms. There may be several sets of norms for a particular test, applicable to different groups of examinees. Thus, getting 75 per cent of the vocabulary items correct may turn out to be excellent when compared to norms based on high school students, but only average when compared to norms based on college graduates. If one is using a test to select engine mechanics, it is best to compare an applicant’s score with norms obtained from previous applicants for this job, as well as with norms of actual mechanics.
The mental age norm is one in which an individual’s score on an intelligence test is compared to the average score obtained by people of different ages. This, of course, is applicable mainly to children. For adults, the percentile norm is most frequently used. A large number of people (at least several hundred) are tested, the scores ranked, and the percentage of people falling below each score is determined. Let us suppose that an individual who gets a raw score of 35 on a test turns out to be at the 65th percentile. This tells us immediately that the person scored better than 65 per cent of the individuals in the group for which test norms were determined. A score at the 50th percentile is, by definition, the median of the distribution. The scores made by future applicants for a job may subsequently be evaluated by comparing them with the percentiles of the norm group.
Another type of norm is the standard score. Each individual’s score can be expressed as a discrepancy from the average score of the entire group. When we divide this deviation by the standard deviation (SD) of the scores of the entire group, we have a standard score, or a score expressed in SD units. Typically, a test manual will include these standard-score equivalents as well as percentile equivalents for each raw score.
From this discussion, it is evident that a psychological test usually has no arbitrary pass-fail score.
Reliability. One of the most important characteristics of a test is its reliability. This refers to the degree to which the test measures something consistently. If a test yielded a score of 135 for an individual one day and 85 the next, we would term the test unreliable. Before psychological tests are used they are first evaluated for reliability. This is often done by the test-retest method, which involves giving the same test to the same individuals at two different times in an attempt to find out whether the test generally ranks individuals in about the same way each time. The statistical correlation technique is used, and the resulting correlation is called the reliability coefficient. Test designers try to achieve test reliabilities above .90, but often reliabilities of .80 or .70 are useful for predicting job success. Sometimes two equivalent forms of a test are developed; both are then given to the same individuals and the correlation determined. Sometimes a split-half method is used; scores on half the items are correlated with scores on the remaining half. Tests that are short often are unreliable, as are many tests that do not use objectively determined scores.
Validity. An essential characteristic of aptitude tests is their validity. Whereas reliability refers to consistency of measurement, validity generally means the degree to which the test measures what it was designed to measure. A test may be highly reliable but still not valid. A thermometer, for example, may give consistent readings but it is certainly not a valid instrument for measuring specific gravity. Similarly, a test designed to select supervisors may be found to be highly reliable; but it will not be a valid test if scores made by new supervisors do not correlate with their later proficiency on the job.
When used for personnel selection purposes, the validity of aptitude tests is evaluated by finding the degree to which they correlate with some measure of performance on the job. The question to be answered is, Does the test given to a job applicant predict some aspect of his later job performance? The correlation obtained in such a determination is known as the validity coefficient. This is found by administering the test to unselected job applicants and later obtaining some independent measure of their performance on the job. If the validity coefficient is a substantial one, the test may be used to predict the job success of new applicants, just as it has demonstrated it can do with the original group. If the validity coefficient is low, the test is discarded as a selection instrument for this job, since it has failed to make the desired prediction of job performance.
Validity coefficients need not be very high in absolute value to make useful predictions in matching men to job requirements. A test was given to 1,000 applicants for pilot training in the Air Force. These applicants were allowed to go through training; six months later their proficiency was evaluated. It was found that scores on this ten-minute test correlated .45 with the performance of these individuals as pilots six months later. Very few of those scoring high on the test subsequently failed training, while over half of those scoring low on the test eventually failed.
Why are some tests valid and others not? The reason must be that valid tests are those that measure the kinds of abilities and skills actually needed on the job. It should be noted that tests often do not directly resemble tasks of the job, even when they are highly valid. For example, the Rotary Pursuit Test was found to have considerable validity in predicting success in pilot and bombardier training for the Army Air Force during World War II. This test requires the examinee to keep a metal stylus in contact with a target spot set toward the edge of a rotating disc. Often the examinees may have thought, “Where does the pilot (or bombardier) do anything like this?” But the reason this test is valid is not because of its resemblance to any task of these jobs, but because it samples control precision ability, which facilitates the learning of the jobs. (This ability factor was identified by factor analysis research.) Sometimes, in contrast, tests that appear superficially to resemble actual tasks of the job turn out to be of low validity because they fail to sample relevant abilities.
Predictive validity of the kind described above is not the only kind of validity. We may also be interested in the extent to which the test actually measures the trait we assume it measures, a somewhat different concern from the criterion it is designed to predict. This is called construct validity. Thus, a test assumed to be a spatial test may turn out to tap mainly the ability to understand the verbal instructions. Construct validity can be determined only experimentally, through correlation with other measures.
The selection ratio. Another important factor affecting the success of aptitude tests in personnel selection procedures is the selection ratio. This is the ratio of those selected to those available for placement. If there are only a few openings and many applicants, the selection ratio is low; and this is the condition under which a selection program works best. For example, if only a few pilots are needed relative to the number of applicants available, one can establish a high qualifying score on the aptitude test, and there will be very few subsequent failures among those accepted. On the other hand, if practically all applicants have to be accepted to fill the vacancies, the test is not useful, regardless of its validity, since this amounts to virtual abandonment of the selection principle. If the selection ratio is kept low, validity coefficients even as low as .20 can still identify useful tests. If the selection ratio is high, higher validity is necessary.
Combining tests into a battery. Aptitude tests given in combination as multiple aptitude batteries would seem most appropriate where decisions have to be made regarding assignment of applicants to one out of several possible jobs. This kind of classification requires maximum utilization of an available manpower pool, where the same battery of tests, weighted in different combination, provides predictive indexes for each applicant for each of several jobs. Since the validity of these tests has been separately determined for each job, it may be found, for example, that tests A, D, and E predict success in job Y, while tests B, D, and C, predict success in job X. By the appropriate combinations of test scores, it is then possible to find each applicant’s aptitude index for job X as well as for job Y. The most efficient batteries are those in which the tests have a low correlation with each other (hence, there is less duplication of abilities measured) and where the individual tests have high validity for some jobs but not for others. Thus, if a test score predicts success on job Y but not job X, a high score on this test would point to an assignment on job Y. A test that is valid for all jobs is not very useful in helping us decide the particular job for which an individual is best suited.
There are two main methods of combining scores from a test battery to make predictions of later job performance. One method is called the successive hurdle or multiple-cutoff method. With this approach, applicants are accepted or rejected on the basis of one test score at a time. In order to be selected, an applicant must score above a critical score on each test; he is disqualified by a low score on any one test.
The second approach uses multiple correlation. From the validity of the tests and their correlations with each other, a determination can be made of a proper weight for each test score. Using these weights as multipliers for test scores, a value of a total aptitude index can be computed for each individual. This method, then, produces a combined weighted score, which reflects the individual’s performance on all the tests in a battery. The particular method chosen for combining scores depends on a number of factors in the selection situation, but both methods, which are based on aptitude information from a number of different tests, accomplish the purpose of making predictions of job success.
Edwin A. Fleishman
Adkins, Dorothy C. 1947 Construction and Analysis of Achievement Tests. Washington: Government Printing Office.
Anastasi, Anne (1954) 1961 Psychological Testing. 2d ed. New York: Macmillan.
Buros, Oscar K. (editor) 1959 The Fifth Mental Measurements Yearbook. Highland Park, N.J.: Gryphon. → See especially pages 667–721 on multiaptitude batteries.
Cronbach, Lee J. (1949) 1960 Essentials of Psychological Testing. 2d ed. New York: Harper.
Cronbach, Lee J.; and Glaser, Goldine C. (1957) 1965 Psychological Tests and Personnel Decisions. Urbana: Univ. of Illinois Press.
Cureton, Edward E.; and Cureton, Louise W. 1955 The Multi-aptitude Test. New York: Psychological Corp.
Dvorah, Beatrice J. 1956 The General Aptitude Test Battery. Personnel Guidance Journal 35:145–154.
Fleishman, Edwin A. 1956 Psychomotor Selection Tests: Research and Application in the United States Air Force. Personnel Psychology 9:449–467.
Fleishman, Edwin A. (editor) 1961 Studies in Personnel and Industrial Psychology. Homewood, III.: Dorsey.
Fleishman, Edwin A. 1964 The Structure and Measurement of Physical Fitness. Englewood Cliffs, N.J.: Prentice-Hall.
French, John W. 1951 The Description of Aptitude and Achievement Tests in Terms of Rotated Factors. Psychometric Monographs No. 5.
GagnÉ, Robert M.; and Fleishman, Edwin A. 1959 Psychology and Human Performance. New York: Holt.
Ghiselli, Edwin E. 1955 The Measurement of Occupational Aptitude. California, University of, Publications in Psychology 8:101–216.
Ghiselli, Edwin E.; and Brown, Clarence W. (1948) 1955 Personnel and Industrial Psychology. 2d ed. New York: McGraw-Hill.
Guilford, Joy P. (editor) 1947 Printed Classification Tests. U.S. Army Air Force, Aviation Psychology Program, Research Report No. 5. Washington: Government Printing Office.
Guilford, J. P. 1959 Three Faces of Intellect. American Psychologist 14:469–479.
Gulliksen, Harold 1950 Theory of Mental Tests. New York: Wiley.
Loevinger, Jane 1957 Objective Tests as Instruments of Psychological Theory. Psychological Reports 3:635–694.
Melton, Arthur W. (editor) 1947 Apparatus Tests. U.S. Army Air Force, Aviation Psychology Program, Research Report No. 4. Washington: Government Printing Office.
Super, Donald E.; and CRITES, J. O. (1949) 1962 Appraising Vocational Fitness by Means of Psychological Tests. Rev. ed. New York: Harper.
U.S. Employment Service 1946–1958 General Aptitude Test Battery. Washington: Government Printing Office.
Vernon, Philip E. (1950) 1961 The Structure of Human Abilities. 2d ed. London: Methuen.