There are many standards that can be used to evaluate the status of a science, and one of the most important is how well variables are measured. The idea of measurement is relatively simple. It is associating numbers with aspects of objects, events, or other entities according to rules, and so measurement has existed for as long as there have been numbers, counting, and concepts of magnitude. In daily living, measurement is encountered in myriad ways. For example, measurement is used in considering time, temperature, distance, and weight. It happens that these concepts and quite a few others are basic to many sciences. The notion of measurement as expressing magnitudes is fundamental, and the observation that if something exists, it must exist in some quantity is probably too old to attribute to the proper authority. This notion of quantification is associated with a common dictionary definition of measurement: "The extent, capacity, or amount ascertained by measuring."
A concept such as distance may be considered to explore the meaning of measurement. To measure distance, one may turn to a simple example of a straight line drawn between two points on a sheet of paper. There is an origin or beginning point and an end point, and an infinite number of points between the beginning and the end. To measure in a standard way, a unit of distance has to be arbitrarily defined, such as an inch. Then the distance of any straight line can be observed in inches or fractions of inches. For convenience, arbitrary rules can be established for designating number of inches, such as feet, yards, and miles. If another standard is used—say, meters—the relationship between inches and meters is one in which no information is lost in going from one to the other. So, in summary, in the concept of measurement as considered thus far, there are several properties, two of which should be noted particularly: the use of arbitrary standardized units and the assumption of continuous possible points between any two given points. The case would be similar if time, temperature, or weight were used as an example.
There is another property mentioned above, a beginning point, and the notion of the beginning point has to be examined more carefully. In distance, if one measures 1 inch from a beginning point on a straight line, and then measures to a second point 2 inches, one may say that the distance from the beginning point of the second point is twice that of the first point. With temperature, however, there is a problem. If one measures temperature from the point of freezing using the Celsius scale, which sets 0 degrees at the freezing point of water under specified conditions, then one can observe temperatures of 10 degrees and of 20 degrees. It is now proper to say that the second measurement is twice as many degrees from the origin as the first measure, but one cannot say that it is twice the temperature. The reason for this is that the origin that has been chosen is not the origin that is required to make that kind of mathematical statement. For temperature, the origin is a value known as absolute zero, the absence of any heat, a value that is known only theoretically but has been approximated.
This problem is usually understood easily, but it can be made more simple to understand by illustrating how it operates in measuring distance. Suppose a surveyor is measuring distance along a road from A to B to C to D. A is a long distance from B. Arriving at B, the surveyor measures the distance from B to C and finds it is 10 miles, and then the distance from B to D is found to be 20 miles. The surveyor can say that the distance is twice as many miles from B to D as from B to C, but he cannot say that the distance from A to D is twice the distance from A to C, which is the error one would make if one used the Celsius temperature scale improperly. Measuring from the absolute origin for the purpose of carrying out mathematical operations has become known as ratio level measurement.
The idea of "levels of measurement" has been popularized following the formulation by the psychologist S. S. Stevens (1966). Stevens first identifies scales of measurement much as measurement is defined above, and then notes that the type of scale achieved depends upon the basic empirical operations performed. The operations performed are limited by the concrete procedures and by the "peculiarities of the thing being scaled." This leads to the types of scales—nominal, ordinal, interval, and ratio—which are characterized "by the kinds of transformations that leave the 'structure' of the scale undistorted." This "sets limits to the kinds of statistical manipulation that can legitimately be applied to the scaled data."
Nominal scales can be of a type like numbering individuals for identification, which creates a class for each individual. Or there can be classes for placement on the basis of equality within each class with regard to some characteristic of the object. Ordinal scales arise from the operation of rank ordering. Stevens expressed the opinion that most of the scales used by psychologists are ordinal, which means that there is a determination of whether objects are greater than or less than each other on characteristics of the object, and thus there is an ordering from smallest to largest. This is a crucial point that is examined below. Interval scales (equal-interval scales) are of the type discussed above, like temperature and these are subject to linear transformation with invariance. There are some limitations on the mathematical operations that can be carried out, but in general these limitations do not impede use of most statistical and other operations carried out in science. As noted, when the equal-interval scales have an absolute zero, they are called ratio scales. A lucid presentation of the issue of invariance of transformations and the limitations of use of mathematical operations (such as addition, subtraction, multiplication, and division) on interval scales is readily available in Nunnally (1978).
What is important to emphasize is that how the scales are constructed, as well as how the scalesare used, determines the level of measurement. With regard to ordinal scales, Nunnally makes a concise and precise statement that should be read carefully: "With ordinal scales, none of the fundamental operations of algebra may be applied. In the use of descriptive statistics, it makes no sense to add, subtract, divide, or multiply ranks. Since an ordinal scale is defined entirely in terms of inequalities, only the algebra of inequalities can be used to analyze measures made on such scales" (1978, p. 22). What this means is that if one carries out a set of operations that are described as making an ordinal scale, the moment one adds, subtracts, divides, or multiplies the ranks, one has treated the scale as a particular type of interval scale. Most commonly, the type of scale that is de facto created when ordinal data are subject to ordinary procedures like addition, subtraction, division, and/or multiplication is through the assumption that the difference between ranks are equal, leading to sets like 1, 2, 3, 4, 5, 6, and thus to the treatment for ties such as 1, 2, 4, 4, 4, 6. This is sometimes called a flat distribution with an interval of one unit between each pair of ordered cases. Effectively, this is the same kind of distribution in principle as the use of ordered categories, such as quartiles, deciles, or percentiles, but it is a more restrictively defined distribution of one case per category. To repeat, for emphasis: The use of addition, subtraction, division, and/or multiplication with ordinal data automatically requires assumptions of intervals, and one is no longer at the level of ordinal analysis. Thus, virtually all statistical procedures based on collected ordered or rank data actually assume a special form of interval data.
ISSUES ON LEVEL OF MEASUREMENT
A number of issues are associated with the notion of levels of measurement. For example, are all types of measurement included in the concepts of nominal, ordinal, interval, and ratio? What is the impact of using particular statistical procedures when data are not in the form of well-measured interval scales? What kind of measurement appears (epistemologically) appropriate for the social and behavioral sciences?
The last question should probably be examined first. For example, are measures made about attributes of persons nominal, ordinal, or interval? In general, we cannot think of meaningful variables unless they at least imply order, but is order all that one thinks of when one thinks about characteristics of persons? For example, if one thinks of heights, say of all males of a given age, such as 25, does measurement imply ordering them on an interval scale? We know that height is a measure of distance, so we assume the way one should measure this is by using a standard. For purposes of the example here, an interval scale is proposed, and the construction is as follows. The shortest, 25 year-old male (the category defined as 25 years and 0 days to 25 years and 365 days of age) and the tallest are identified. The two persons are placed back to back, front to front, and every other possible way, and the distance between the height of the shortest and the tallest is estimated on a stick. Many estimates are made on the stick, until the spots where agreement begins to show discretely are evident; and so, with whatever error occurs in the process, the locations of beginning and end are indicated on the stick. Now the distance between the beginning and the end is divided into equal intervals, and thus an interval scale has been created. On this scale it is possible to measure every other male who is 25 years old, and the measure can be stated in terms of the number of intervals taller than the shortest person. Note that all possible values can be anticipated, and this is a continuous distribution.
Now if a million persons were so measured, how would they be distributed? Here the answer is on the basis of naive experience, as follows. First, there would be very few people who would be nearly as short as the shortest or as tall as the tallest. Where would one expect to find most persons? In the middle of the distance, or at some place not too far from it. Where would the next greatest number of persons be found? Close to the biggest. With questions of this sort one ends up describing a well-distributed curve, possibly a normal curve or something near it.
It is proper now to make a small diversion before going on with answering the questions about the issues associated with level of measurement. In particular, it should be noted that there are many sources of error in the measurement that has just been described. First, of course, the age variable is specified with limited accuracy. At the limits, it may be difficult to determine exact age because of the way data are recorded. There are differences implied by the fact that where one is born makes a difference in time, and so on. This may seem facetious, but it illustrates how easily sources of error are bypassed without examination. Then it was noted that there were different estimates of the right location for the point of the shortest and the tallest person as marked on the stick. This is an error of observation and recording, and clearly the points selected are taken as mean values. Who are the persons doing the measuring? Does it make a difference if the person measuring is short or tall? These kinds of errors will exist for all persons measured. Further, it was not specified under what conditions the measurements taken or were to be taken. Are the persons barefoot? How are they asked to stand? Are they asked to relax to a normal position or to try to stretch upward? What time of day is used, since the amount of time after getting up from sleep may have an influence? Is the measurement before or after a meal? And so forth.
The point is that there are many sources of error in taking measures, even direct measures of this sort, and one must be alert to the consequences of these errors on what one does with the data collected. Errors of observation are common, and one aspect of this is the limit of the discriminations an observer can make. One type of error that is usually built into the measurement is rounding error, which is based on the estimated need for accuracy. So, for example, heights are rarely measured more accurately than to the half-inch or centimeter, depending on the standard used. There is still the error of classification up or down, by whatever rule is used for rounding, at the decision point between the intervals used for rounding. Rounding usually follows a consistent arbitrary rule, such as "half adjusting," which means keeping the digit value if the next value in the number is 0 to 4 (e.g., 24.456 = 24) or increasing the value of a digit by 1 if the next value is 5 to 9 (e.g., 24.789 = 25). Another common rounding rule is simply to drop numbers (e.g., 24.456 = 24 and 24.789 = 24). It is important to be aware of which rounding rule is being used and what impact it may have on conclusions drawn when the data collected are used.
The use of distribution-free statistics (often called nonparametric statistics) was popularized beginning in the mid-1950s, and quickly came to be erroneously associated with the notion that most of the measurement in the social and behavioral sciences is of an ordinal nature. Actually, the use of the distribution-free statistics was given impetus because some tests, such as the sign test, did not require use of all the information available to do a statistical test of significance of differences. Thus, instead of using a test of differences of means, one could quickly convert the data to plus and minus scores, using some arbitrary rule, and do a "quick-and-dirty" sign test. Then, if one found significant differences, the more refined test could be carried out at one's leisure. Some of the orientation was related to computing time available, which meant time at a mechanical calculator. Similarly, it was well known that if one used a Spearman rank correlation with larger samples, and if one were interested in measuring statistical significance, one would have to make the same assumptions as for the Pearson product moment correlation, but with less efficiency.
However, this early observation about distribution-free statistics suggests that measures can be thought of in another way. Namely, one can think of measures in terms of how much they are degraded (or imperfect) interval measures. This leads to two questions that are proper to consider. First, what kind of measure is implied as appropriate by the concept? And second, how much error is there in how the measure is constructed if one wants to use procedures that imply interval measurement, including addition, subtraction, multiplication, and division?
What kind of measure is implied by the concept? One way of answering this is to go through the following procedure. As an example, consider a personal attribute, such as aggressiveness. Is it possible to conceive of the existence of a least aggressive person and a most aggressive person? Obviously, whether or not such persons can be located, they can be conceived of. Then, is there any reason to think that persons cannot have any and all possible quantities of aggressiveness between the least and the most aggressive persons? Of course not. Thus, what has been described is a continuous distribution, and with the application of a standard unit, it is appropriately an interval scale. It is improper to think of this variable as intrinsically one that is ordinal because it is continuous. In fact, it is difficult to think of even plausible examples ofvariables that are intrinsically ordinal. As Kendall puts it, "the essence of ranking is that the objects shall be orderable, and the totality of values of a continuous variate cannot be ordered in this sense. They can be regarded as constituting a range of values, but between any two different values there is always another value, so that we cannot number them as would be required for ranking purposes" (1948, p. 105). While a few comments were published that attempted to clarify these issues of measurement in the 1960s (Borgatta 1968), most methodologists accepted the mystique of ordinal measurement uncritically.
Often measures of a concept tend to be simple questions with ordered response categories. These do not correspond to ordinal measures in the sense of ordering persons or objects into ranks, but the responses to such questions have been asserted to be ordinal level measurement because of the lack of information about the intervals. So, for example, suppose one is attempting to measure aggressiveness using a question such as "When you are in a group, how much of the time do you try to get your way about what kinds of activities the group should do next?" Answer categories are "never," "rarely," "sometimes," "often," "very often," and "always." Why don't these categories form an interval scale? The incorrect answer usually given is "because if one assumes an interval scale, one doesn't know where the answer categories intersect the interval scale." However, this does not create an ordinal scale. It creates an interval scale with unknown error with regard to the spacing of the intervals created by the categories.
Thus, attention is now focused on the second question: How much error is involved in creating interval scales? This question can be answered in several ways. A positive way of answering is by asking how much difference it makes to distort an interval scale. For example, if normally distributed variables (which are assumed as the basis for statistical inference) are transformed to flat distributions, such as percentiles, how much impact does this have on statistical operations that are carried out? The answer is "very little." This property of not affecting results of statistical operations has been called robustness. Suppose a more gross set of transformations is carried out, such as deciles. How much impact does this have on statistical operations? The answer is "not much." However, when the transformations are to even grosser categories, such as quintiles, quartiles, thirds, or halves, the answer is that because one is throwing away even more information by grouping into fewer categories, the impact is progressively greater. What has been suggested in this example has involved two aspects: the transformation of the shape of the distribution, and the loss of discrimination (or information) by use of progressively fewer categories.
CONSEQUENCES OF USING LESS THAN NORMALLY DISTRIBUTED VARIABLES
If one has normally distributed variables, the distribution can be divided into categories. The interval units usually of interest with normally distributed variables are technically identified as standard deviation units, but other units can be used. When normally distributed variables are reduced to a small number of (gross) categories, substantial loss of discrimination or information occurs. This can be illustrated by doing a systematic exercise, the results of which are reported in Table 1. The data that are used for the exercise are generated from theoretical distributions of random normal variables with a mean of 0 and a standard deviation of 1. In the exercise, one aspect is examining the relationship among normally distributed variables, but the major part of the exercise involves the data in ordered categorical form, much as it is encountered in "real" data. The exercise permits reviewing several aspects associated with knowledge about one's measurement.
As noted, the exercise is based on unit normal variables (mean = 0, standard deviation = 1) that are sampled and thus are subject to the errors of random sampling that one encounters with "real" data. The theoretical underlying model is one that is recommended in practice, specified as follows:
- A criterion variable is to be predicted, that is, the relationships of some independent (predictor) variables are to be assessed with regard to the criterion variable.
- For each sample in which the relationship of the independent variables is assessed, the theoretical underlying relationship of the independent variables is specified, and thus for the purposes of the exercise is known. For the exercise, four levels of underlying theoretical relationship are product
Table 1 median correlation of scores with criterion variable, range of 9 samples (n=150) of unit normal deviates, additive scores based on four items with theoretical correlations of .8, .6, .4, and .2 with the criterion variable in the population xo xor xoa xoar xob xobr xo2 xoa2 xob2 sumx.8 .94 .93–.96 .75 .69–.78 .66 .52–.73 .88 .56 .44 suma.8 .92 .90–.94 .76 .74–.80 .62 .56–.67 .85 .58 .38 sumb.8 .89 .87–.92 .68 .63–.74 .62 .58–.68 .78 .46 .38 sumc.8 .85 .81–.86 .78 .69–.84 .51 .46–.59 .72 .61 .36 sumdl.8 .72 .64–.74 .54 .48–.60 .72 .64–.77 .52 .29 .52 sumdr.8 .71 .64–.76 .55 .49–.58 .26 .23–.30 .50 .30 .07 sumx.6 .84 .78–.88 .65 .56–.75 .57 .49–.71 .71 .42 .32 suma.6 .82 .77–.85 .66 .60–.71 .58 .48–.62 .67 .44 .34 sumb.6 .79 .70–.81 .62 .55–.65 .51 .43–.60 .62 .38 .26 sumc.6 .74 .70–.79 .65 .56–.70 .48 .40–.54 .55 .42 .23 sumdl.6 .62 .56–.69 .53 .45–.58 .56 .45–.67 .38 .28 .31 sumdr . 6 .62 .57–.67 .52 .45–.56 .32 .23–.36 .38 .27 .10 sumx.4 .65 .59–.71 .53 .44–.57 .48 .31–.54 .42 .28 .23 suma.4 .66 .51–.72 .52 .47–.60 .42 .27–.55 .44 .27 .18 sumb.4 .61 .48–.67 .47 .37–.55 .41 .33–.46 .37 .22 .17 sumc.4 .59 .49–.66 .48 .39–.58 .38 .27–.43 .35 .23 .14 sumdl.4 .45 .32–.52 .39 .30–.47 .33 .15–.51 .20 .15 .11 sumdr.4 .50 .45–.59 .43 .38–.49 .26 .22–.27 .25 .18 .07 sumx.2 .39 .25–.51 .31 .20–.41 .25 .12–.45 .15 .10 .06 suma.2 .31 .26–.42 .23 .16–.34 .27 .22–.31 .10 .05 .07 sumb.2 .32 .20–.49 .29 .12–.29 .23 .10–.29 .10 .08 .05 sumc.2 .28 .23–.42 .22 .15–.40 .17 .10–.24 .08 .05 .03 sumdl.2 .29 .17–.35 .16 .12–.24 .22 .16–.35 .08 .03 .05 sumdr.2 .24 .19–.37 .20 .04–.31 .17 .07–.30 .06 .04 .03
moment correlation coefficients of .8, .6, .4, and .2, between the predictor variables and the criterion variable. This represents different levels of relationship corresponding to magnitudes commonly encountered with "real" social and behavioral data.
- For each sample that is used in the exercise, totally independent distributions are drawn from the theoretical distributions. That is, each sample in the exercise is created independently of all other samples, and in each case the criterion and independent predictor variables are drawn independently within the definition of the theoretical distributions.
- Corresponding to a common model of prediction, assume that the independent variables are of a type that can be used to create scores. Here we will follow a common rule of thumb and use four independent variables to create a simple additive score; that is, the values of the four independent variables are simply added together. There are many ways to create scores, but this procedure has many virtues, including simplicity, and it permits examining some of the consequences of using scores to measure a concept.
- The independent variables are modified to correspond to grouped or categorical data. This is done before adding the variables together to create scores, as this would be the way variables encountered in research with "real" data are usually defined and used. The grouped or categorical independent variables are modified in the following ways:
A. Four groups or categories are created by using three dividing points in the theoretical distribution, −1 standard deviation, the mean of 0, and +1 standard deviation. The values for four variables, now grouped or categorical data, are added to give the score used in the correlation with the criterion variable. Thus, looking at Table 1, the row SumA.8 involves samples in which four independent variables based on data having four categories as defined above, and having an underlying theoretical correlation coefficient for each independent variable to the criterion variable of .8, have been used to create a score. The variables defined by these cutting points correspond to a notion of variables with response categories, say, such as "agree," "probably agree," "probably disagree," and "disagree," and as it happens the responses theoretically are well distributed, roughly 16 percent, 34 percent, 34 percent, and 16 percent.
B. Three groups or categories are created by using two dividing points in the theoretical distribution, -1 and +1, corresponding to a notion of variables with response categories, say, such as "agree," "don't know or neutral," and "disagree," and these are again theoretically well distributed, but in this case the center category is large (about 68 percent ofresponses). The scores for these samples are identified as SumB.8 for the case of the underlying theoretical distribution of the independent variables having a correlation coefficient with the criterion variable of .8.
C. Two groups or categories are created by using one dividing point at the mean of the theoretical distribution, and this corresponds to variables with response categories such as "agree" and "disagree," or "yes" and "no." Division of a variable at the midpoint of the distribution is usually considered statistically to be the most efficient for dichotomous response categories.
DL. Two groups or categories may be created that are not symmetrical, unlike the case of C above. Here we create two sets of samples, one identified by Sum DL with the cutting point at -1 standard deviation, or the left side of the distribution as it is usually represented. Presumably, when variables are not well distributed, more information is lost in the data collection if the skew is due to the way the question is formulated.
DR. Two groups or categories are created with the skew on the right side of the distribution at +1 standard deviation. SumDR scores are created in parallel to the SumDL scores.
Finally, the rows of SumX are those where scores are computed and no conversion of the variables has been carried out, so the independent variables are drawn from the theoretically formulated unit normal distributions at the level of correlation coefficient between the independent variables and the criterion variable.
In summary, for each sample the predictor score is an additive score based on four variables, each related to the criterion variable in a theoretical distribution at a given level, with the examples in the exercise the levels chosen as product moment correlation coefficients of .8, .6, .4, and .2.
The rationale for the choice of four variables in each score is drawn arbitrarily from the theoretical distribution of reliability coefficients, which, for variables of equal reliability, is a curve of diminishing return with each increment of the number of variables. At four variables a score is usually considered to be at an efficient point, balancing such things as time and effort needed to collect data, independence of variable definitions, and the amount of improvement of reliability expected. Using more variables in the "real" world usually involves using less "good" items progressively from the point of view of content, as well as the diminishing efficiency even if they were equally as good as the first four.
With regard to the criterion variable, the dependent variable to be predicted, we have generated three as follows: First, the variable is drawn from a theoretical unit normal distribution without modification, and this involves correlation coefficients reported in column XO. Second, the dependent variable is drawn from the theoretical distribution dichotomized at the mean of 0, creating a symmetric dichotomous variable (two categories), and this involves correlation coefficients reported in column XOA. Third, the variable is drawn from the theoretical distribution dichotomized at -1 standard deviation, creating an asymmetrical dichotomous variable and this involves correlation coefficients reported in column XOB.
As noted above, we are dealing with generated data from theoretical distributions, so all that is done is subject to sampling error, which permits in a limited way for variation of results to be shown. This is useful in illustrating the variation that occurs simply by random processes in the study of variables. For the exercise there were 648 samples of 150 cases each. The choice of 150 cases was selected for several reasons, convenience in the data set generated being one, but also because 150 cases was, as a rule of thumb, a reasonable number of cases proposed for doing research at an earlier time in the discipline when a Guttman scale was anticipated, and too much "shrinkage" in the subsequent use of the Guttman scale was to be avoided. Shrinkage referred to the experience of finding out that the scale in a subsequent use did not have the same characteristics as in the first (generating) use. The samples are grouped in the six procedures, as noted above, of the full normal data and of five created sets of categories, each with nine repetitions (i.e., samples) carried out to permit examination of distribution of results for each theoretical level of correlation coefficient and for each of the three criterion variable definitions.
A sense of the stability of the relationships involved is provided by using many samples, and the values that are in columns XO, XOA, and XOB are the median product moment correlation coefficients between a Sum score and a criterion variable. In the table there are two other sets of columns. XOR, XOAR, and XOBR are the actual range of values for the product moment correlation coefficients between a Sum score and the criterion variable for nine independently drawn samples of 150 cases each. The additional (last three columns) at the right of the table are the squared values of XO, XOA, and XOB, and these values represent the amount of variance that is accounted for by the correlation coefficient.
The table is somewhat complex because it includes a great deal of information. Here we will only review a number of relatively obvious points, but with emphasis. First, note the relationship between SumX.8 and XO, which is .94. Recall that the theoretical correlation coefficient between each independent predictor variable and the criterion variable was defined as .8 in the exercise. Thus, we illustrate rather dramatically the importance of building scores rather than depending on a single predictor variable. The actual improvement in prediction is from a theoretical value of 64 percent of the variance being accounted for by a single predictor variable to the median value of 88 percent of the variance in the median sample of the exercise, an extremely impressive improvement! Examining the correlation coefficient between SumX.6, SumX.4, and SumX.2, it is seen that the advantage of using scores rather than single items is actually even more dramatic when the relationship between the criterion and the predictor variables is weaker.
Now, examine the relationships between all the Sum variables and XO. It is noted that having the SumA.8 as a well distributed categorical variable lowers the correlation to XO somewhat (.92) compared to the full normal SumX.8 (.94), and, indeed, with SumB.8 and SumC.8 the correlation coefficients are still lower. This illustrates that the more information that is available in the categorical data, the more variance can be predicted for a criterion variable. Note further, that for SumDL.8 and SumDR.8, which are based on dichotomous but skewed variables, the correlation coefficients are still lower. Note that the results for the weaker variables SumX.6 to SumDR.6, for SumX.4 to SumDR.4, and for SumX.2 to SumDR.2 are in general roughly parallel to the results noted, but some irregularity is noted when the relationship between the criterion and the predictor variables is weaker.
The results we have been examining are the median values of nine samples in each case. It is appropriate to look at column XOR and to note that there is quite a bit of variation in the correlation coefficients based on the nine samples of size 150 cases each.
One additional finding needs to be pointed out and emphasized. Compare the correlation coefficients of SumDT.8 and XOB, and of SumDR.8 and XOB. In the former case, the criterion variable and predictor variables are both asymmetric but matched (at the same cutting point), and the median correlation coefficient value is .72. In the latter case they are asymmetric but unmatched, and the median correlation coefficient is .26. This indicates the need to know how cutting points (difficulty levels) of variables can affect the values of correlation coefficients that are encountered.
The reader is urged to examine this table in detail, and it may be useful to keep the table as a guideline for interpreting results. How? If the researcher finds as a result a correlation coefficient of a given size, where does it fit on this table, taking into consideration the characteristics of the criterion variable and the scores based on the predictor variables? Doing this type of examination should help the researcher understand the meaning of a finding. However, this is but one approach to the issues of measurement. The articles in this encyclopedia that are vital for researchers to consider are those on Reliability, Validity, and Quasi-Experimental Research Design. Additionally, measurement issues are discussed in a variety of types of texts, ranging from specialized texts to books on research methods to general statistics books, such as the following: Agresti and Finaly 1997; Babbie 1995; De Vellis 1991; Knoke and Bohrnstedt 1994; Lewis-Beck 1994; Neuman, 1997; and Traub 1994.
The consideration of measurement thus far has concentrated on interval measurement, with some emphasis on how it can be degraded. There are many other issues that are appropriately considered, including the fact that the concepts advanced by Stevens do not include all types of measures. As the discussion proceeds to some of these, attention should also be given to the notion of nominal scales. Nominal scales can be constructed in many ways, and only a few will be noted here. By way of example, it is possible to create a classification of something being present for objects that is then given a label A. Sometimes a second category is not defined, and then the second category is the default, the thing not being present. Or two categories can be defined, such as male and female. Note that the latter example can also be defined as male and not male, or as female and not female. More complex classifications are illustrated by geographical regions, such as north, west, east, and south, which are arbitrary and follow the pattern of compass directions. Such classifications, to be more meaningful, are quickly refined to reflect more homogeneity in the categories, and sets of categories develop such as Northeast, Middle Atlantic, South, North Central, Southwest, Mountain, Northwest, and Pacific, presumably with the intention of being inclusive (exhaustive) of the total area. These are complex categories that differ with regard to many variables, and so they are not easily ordered.
However, each such set of categories for a nominal scale can be reduced to dichotomies, such as South versus "not South"; these variables are commonly called "dummy variables." This permits analysis of the dummy variable as though it represented a well-distributed variable. In this case, for example, one could think of the arbitrary underlying variable as being "southernness," or whatever underlies the conceptualization of the "South" as being different from the rest of the regions. Similarly, returning to the male versus female variable, the researcher has to consider interpretatively what the variable is supposed to represent. Is it distinctly supposed to be a measure of the two biological categories, or is it supposed to represent the social and cultural distinction that underlies them?
Many, if not most, of the variables that are of interest to social and behavioral science are drawn from the common language, and when these are used analytically, many problems or ambiguities become evident. For example, the use of counts is common in demography, and many of the measures that are familiar are accepted with ease. However, as common a concept as city size is not without problems. A city is a legal definition, and so what is a city in one case may be quite different from a city in another case. For example, some cities are only central locations surrounded by many satellite urban centers and suburbs that are also defined as cities, while other cities may be made up of a major central location and many other satellite urban centers and suburbs. To clarify such circumstances, the demographers may develop other concepts, like Standard Metropolitan Areas (SMAs), but this does not solve the problem completely; some SMAs may be isolated, and others may be contiguous. And when is a city a city? Is the definition one that begins with the smallest city with a population of 2,500 (not even a geographical characteristic), or 10,000 population, or 25,000 population? Is city size really a concept that is to be measured by population numbers or area, or by some concept of degree of urban centralization? Is New York City really one city or several? Or is New York City only part of one city that includes the urban complexes around it? The point that is critical is that definitions have to be fixed in arbitrary ways when concepts are drawn practically from the common language, and social concepts are not necessarily parsimoniously defined by some ideal rules of formulating scientific theory. Pragmatic considerations frequently intervene in how data are collected and what data become available for use. An additional point is appropriate here: that in demography and in other substantive areas, important measures include counts of discrete entities, and these types of measures do not easily fit the Stevens classification of levels of measurement. A discussion of several technical proposals for more exhaustive classifications of types of measures is considered by Duncan (1984).
There are obviously many ways that measures can be constructed. Some have been formalized and diffused, such as Louis Guttman's cumulative scale analysis, so popular that it has come to be known universally as Guttman scaling, a methodological contribution that was associated with a sociologist and had an appeal for many. An early comprehensive coverage of Guttman scaling can be found in Riley and colleagues (1954). The essence of Guttman scaling is that if a series of dichotomous items is assumed to be drawn from the same universe of content, and if they differ in difficulty, then they can be ordered so that they define scale types. For example, if one examines height, questions could be the following: (1) Are you at least five feet tall? (2) Are you at least five and a half feet tall? (3) Are you at least six feet tall? (4) Are you at least six and a half feet tall? Responses to these would logically fall into the following types: a + to indicate yes and a - to indicate a no:
The types represent "perfect" types, that is, responses made without a logical error. The assumption is that within types, people are equivalent. In the actual application of the procedure, some problems are evident, possibly the most obvious being that there are errors because in applying the procedure in studies, content is not as well specified as being in a "universe" as is the example of height; thus there are errors, and therefore error types. The error types were considered of two kinds: unambiguous, such as - + + +, which in the example above would simply be illogical, a mistake, and could logically be classed as + + + + with "minimum error." The second kind is ambiguous, such as + + - +, which with one (minimum) error could be placed in either type + + - - or type + + + +.
Experience with Guttman scaling revealed a number of problems. First, few scales that appeared "good" could be constructed with more than four or five items because the amount of error with more items would be large. Second, the error would tend to be concentrated in the ambiguous error type. Third, scales constructed on a particular study, especially with common sample sizes of about one hundred cases, would not be as "good" in other studies. There was "shrinkage," or more error, particularly for the more extreme items. The issue of what to do with the placement of ambiguous items was suggested by an alternative analysis (Borgatta and Hays 1952): that the type + + - + was not best placed by minimum error, but should be included with the type + + + - between the two minimum error locations. The reason for this may be grasped intuitively by noting that when two items are close to each other in proportion of positive responses, they are effectively interchangeable, and they are involved in the creation of the ambiguous error type. The common error is for respondents who are at the threshold of decision as to whether to answer positively or negatively, and they are most likely to make errors that create the ambiguous error types.
These observations about Guttman scaling lead to some obvious conclusions. First, the scaling model is actually contrary to common experience, as people are not classed in ordered types in general but presumably are infinitely differentiated even within a type. Second, the model is not productive of highly discriminating classes. Third, and this is possibly the pragmatic reason for doubting the utility of Guttman scaling, if the most appropriate place for locating nonscale or error types of the common ambiguous type is not by minimum error but between the two minimum error types, this is effectively the same as adding the number of positive responses, essentially reducing the procedure to a simple additive score. The remaining virtue of Guttman scaling in the logical placement of unambiguous errors must be balanced against other limitations, such as the requirement that items must be dichotomous, when much more information can be gotten with more detailed categories of response, usually with trivial additional cost in data collection time.
In contrast with Guttman scaling, simple addition of items into sum scores, carried out with an understanding of what is required for good measurement, is probably the most defensible and useful tool. For example, if something is to be measured, and there appear to be a number of relatively independent questions that can be used to ascertain the content, then those questions should be used to develop reliable measures. Reliability is measured in many ways, but consistency is the meaning usually intended, particularly internal consistency of the component items, that is, high intercorrelation among the items in a measure. Items can ask for dichotomous answers, but people can make more refined discriminations than simple yeses and nos, so use of multiple (ordered) categories of response increases the efficiency of items.
The question of whether the language as used has sufficient consistency to make refined quantitative discriminations does not appear to have been studied extensively, so a small data collection was carried out to provide the following example. People were asked to evaluate a set of categories with the question "How often does this happen?" The instructions stated: "Put a vertical intersection where you think each category fits on the continuum, and then place the number under it. Categories 1 and 11 are fixed at the extremes for this example. If two categories have the same place, put the numbers one on top of the other. If the categories are out of order, put them in the order you think correct." A continuum was then provided with sixty-six spaces and the external positions of the first and the last indicated as the positions of (1) always and (11) never. The respondents were asked to locate the following remaining nine categories: (2) almost always (3) very often; (4) often; (5) somewhat often; (6) sometimes; (7) seldom; (8) very seldom; (9) hardly ever; and (10) almost never. It is not surprising that average responses on the continuum are well distributed, with percent locations respectively as 9, 15, 27, 36, 48, 65, 75, 86, and 93; the largest standard deviation for placement location is about 11 percent. Exercises with alternative quantitatively oriented questions and use of a series of six categories from "definitely agree" to "definitely disagree" provide similar evidence of consistency of meaning. In research, fewer than the eleven categories illustrated here usually used, making the task of discrimination easier and faster for respondents. The point of emphasis is that questions can be designed to efficiently provide more information than simple dichotomous answers and thus facilitate construction of reliable scores.
MEASUREMENT IN THE REAL WORLD
Many variations exist on how to collect information in order to build effective measurement instruments. Similarly, there are alternatives on how to build the measuring instruments. Often practical considerations must be taken into account, such as the amount of time available for interviews, restraints placed on what kinds of content can be requested, lack of privacy when collecting the information, and other circumstances.
With the progressive technology of computers and word processors, the reduced dependence of researchers on assistants, clerks, and secretaries has greatly facilitated research data handling and analysis. Some changes, like Computer Assisted Telephone Interviewing (CATI) may be seen as assisting data collection, but in general the data collection aspects of research are still those that require most careful attention and supervision. The design of research, however, still is often an ad hoc procedure with regard to the definition of variables. Variables are often created under the primitive assumption that all one needs to do is say, "The way I am going to measure XXX is by responses to the following question." This is a procedure of dubious worth, since building knowledge about the measurement characteristics of the variables to be used should be in advance of the research, and is essential to the interpretation of findings.
A comment that is commonly encountered is that attempting to be "scientific" and developing a strict design for research with well-developed measures forecloses the possibility of getting a broad picture of what is being observed. The argument is then advanced that attempting to observe and accumulate data in systematic research is not as revealing as observing more informally (qualitatively) and "getting a feel" for what is going on. Further, when systematic research is carried out, so goes the argument, only limited variables can be assessed instead of "getting the complete picture." This, of course, is the ultimate self-delusion and can be answered directly. If positive findings for the theory do not result from more rigorous, well-designed research, then the speculative generalizations of more casual observation are never going to be any more than that, and giving them worth may be equivalent to creating fictions to substitute for reality. The fact that attempted systematic empirical research has not produced useful findings does not mean that more intuitive or qualitative approaches are more appropriate. What it means is that the theory may not be appropriate, or the design of the research may be less than adequate.
Further, this does not mean that there is anything wrong with informal or qualitative research. What it does mean is that there is a priority order in the accumulation of knowledge that says that the informal and qualitative stages may be appropriate to produce theory, which is defined as speculation, about what is being observed, and this may then be tested in more rigorous research. This is the common order of things in the accumulation of scientific knowledge.
If a sociological theory has developed, then it must be stated with a clear specification of the variables involved. One cannot produce a volume on the concept of anomie, for example, and then use the word "anomie" to mean twenty different things. The concept on which one focuses must be stated with a clear specification of one meaning, and there are two elements that go into such a definition. The first is to indicate how the concept is to be measured. The second is more commonly neglected, and that is to specify how the concept is differentiated from other concepts, particularly those that are closely related to it in meaning.
The development of well-measured variables in sociology and the social sciences is essential to the advancement of knowledge. Knowledge about how good measurement can be carried out has advanced, particularly in the post–World War II period, but it has not diffused and become sufficiently commonplace in the social science disciplines.
It is difficult to comprehend how substituting no measurement or poor measurement for the best measurement that sociologists can devise can produce better or more accurate knowledge. Examples of the untenable position have possibly decreased over time, but they still occur. Note for example: "Focus on quantitative methods rewards reliable (i.e., repeatable) methods. Reliability is a valuable asset, but it is only one facet of the value of the study. In most studies, reliability is purchased at the price of lessened attention to theory, validity, relevance. etc." (Scheff 1991). Quite the contrary, concern with measurement and quantification is concern with theory, validity, and relevance!
Finally, it is worth emphasizing two rules of thumb for sociologists concerned with research, whether they are at the point of designing research or interpreting the findings of a research that has been reported. First, check on how the variables are specified and ask whether they are measured well. This requires that specific questions be answered: Are the variables reliable? How does one know they are reliable? Second, are the variables valid? That is, do they measure what they are supposed to measure? How does one know they do? If these questions are not answered satisfactorily, then one is dealing with research and knowledge of dubious value.
Agresti, Alan, and Barbara Finaly 1997 Statistical Methods for Social Sciences. Englewood Cliffs, N. J.: Prentice Hall.
Babbie, Earl 1995 The Practice of Social Research. Belmont, Calif.: Wadsworth.
Borgatta, Edgar F. 1968 "My Student, the Purist: A Lament." Sociological Quarterly 9:29–34.
——, and David G. Hays 1952 "Some Limitations on the Arbitrary Classifications of Non-Scale Response Patterns in a Guttman Scale." Public Opinion Quarterly 16:273–291.
De Vellis, Robert F. 1991 Scale Development. Newbury Park, Calif.: Sage.
Duncan, Otis Dudley 1984 Notes of Social Measurement. New York: Russell Sage Foundation.
Herzog, Thomas 1997 Research Methods and Data Analysis in the Social Sciences. Englewood Cliffs, N.J.: Prentice Hall.
Kendall, Maurice G. 1948 Rank Correlation Methods. London: Griffin.
Knoke, David, and George W. Bohrnstedt 1994. Statistics for Social Data Analysis. Itasca, Ill.: Peacock.
Lewis-Beck, Michael, ed. 1994 Basic Measurement. Beverly Hills, Calif.: Sage.
Neuman, Lawrence W. 1997 Social Research Methods: Qualitative and Quantitative Approaches. Boston: Allyn and Bacon.
Nunnally, Jum C. 1978 Psychometric Theory. New York: McGraw-Hill.
Riley, Matilda White, John W. Riley, Jr., and Jackson Toby 1954 Sociological Studies in Scale Analysis. New Brunswick, N.J.: Rutgers University Press.
Scheff, Thomas J. 1991 "Is There a Bias in ASR Article Selection." Footnotes 19(2, February):5.
Stevens, S. S., ed. 1966 Handbook of Experimental Psychology. New York: John Wiley.
Traub, Ross E. 1994 Reliability for the Social Sciences. Beverly Hills, Calif.: Sage.
Edgar F. Borgatta
Measurement seems like a simple subject, on the surface at least; indeed, all measurements can be reduced to just two components: number and unit. Yet one might easily ask, "What numbers, and what units?"—a question that helps bring into focus the complexities involved in designating measurements. As it turns out, some forms of numbers are more useful for rendering values than others; hence the importance of significant figures and scientific notation in measurements. The same goes for units. First, one has to determine what is being measured: mass, length, or some other property (such as volume) that is ultimately derived from mass and length. Indeed, the process of learning how to measure reveals not only a fundamental component of chemistry, but an underlying—if arbitrary and manmade—order in the quantifiable world.
HOW IT WORKS
In modern life, people take for granted the existence of the base-10, of decimal numeration system—a name derived from the Latin word decem, meaning "ten." Yet there is nothing obvious about this system, which has its roots in the ten fingers used for basic counting. At other times in history, societies have adopted the two hands or arms of a person as their numerical frame of reference, and from this developed a base-2 system. There have also been base-5 systems relating to the fingers on one hand, and base-20 systems that took as their reference point the combined number of fingers and toes.
Obviously, there is an arbitrary quality underlying the modern numerical system, yet it works extremely well. In particular, the use of decimal fractions (for example, 0.01 or 0.235) is particularly helpful for rendering figures other than whole numbers. Yet decimal fractions are a relatively recent innovation in Western mathematics, dating only to the sixteenth century. In order to be workable, decimal fractions rely on an even more fundamental concept that was not always part of Western mathematics: place-value.
Place-Value and Notation Systems
Place-value is the location of a number relative to others in a sequence, a location that makes it possible to determine the number's value. For instance, in the number 347, the 3 is in the hundreds place, which immediately establishes a value for the number in units of 100. Similarly, a person can tell at a glance that there are 4 units of 10, and 7 units of 1.
Of course, today this information appears to be self-evident—so much so that an explanation of it seems tedious and perfunctory—to almost anyone who has completed elementary-school arithmetic. In fact, however, as with almost everything about numbers and units, there is nothing obvious at all about place-value; otherwise, it would not have taken Western mathematicians thousands of years to adopt a place-value numerical system. And though they did eventually make use of such a system, Westerners did not develop it themselves, as we shall see.
Numeration systems of various kinds have existed since at least 3000 b.c., but the most important number system in the history of Western civilization prior to the late Middle Ages was the one used by the Romans. Rome ruled much of the known world in the period from about 200 b.c. to about a.d. 200, and continued to have an influence on Europe long after the fall of the Western Roman Empire in a.d. 476—an influence felt even today. Though the Roman Empire is long gone and Latin a dead language, the impact of Rome continues: thus, for instance, Latin terms are used to designate species in biology. It is therefore easy to understand how Europeans continued to use the Roman numeral system up until the thirteenth century a.d.—despite the fact that Roman numerals were enormously cumbersome.
The Roman notation system has no means of representing place-value: thus a relatively large number such as 3,000 is shown as MMM, whereas a much smaller number might use many more "places": 438, for instance, is rendered as CDXXXVIII. Performing any sort of calculations with these numbers is a nightmare. Imagine, for instance, trying to multiply these two. With the number system in use today, it is not difficult to multiply 3,000 by 438 in one's head. The problem can be reduced to a few simple steps: multiply 3 by 400, 3 by 30, and 3 by 8; add these products together; then multiply the total by 1,000—a step that requires the placement of three zeroes at the end of the number obtained in the earlier steps.
But try doing this with Roman numerals: it is essentially impossible to perform this calculation without resorting to the much more practical place-value system to which we're accustomed. No wonder, then, that Roman numerals have been relegated to the sidelines, used in modern life for very specific purposes: in outlines, for instance; in ordinal titles (for example, Henry VIII); or in designating the year of a motion picture's release.
The system of counting used throughout much of the world—1, 2, 3, and so on—is the Hindu-Arabic notation system. Sometimes mistakenly referred to as "Arabic numerals," these are most accurately designated as Hindu or Indian numerals. They came from India, but because Europeans discovered them in the Near East during the Crusades (1095-1291), they assumed the Arabs had invented the notation system, and hence began referring to them as Arabic numerals.
Developed in India during the first millennium b.c., Hindu notation represented a vast improvement over any method in use up to or indeed since that time. Of particular importance was a number invented by Indian mathematicians: zero. Until then, no one had considered zero worth representing since it was, after all, nothing. But clearly the zeroes in a number such as 2,000,002 stand for something. They perform a place-holding function: otherwise, it would be impossible to differentiate between 2,000,002 and 22.
Uses of Numbers in Science
Chemists and other scientists often deal in very large or very small numbers, and if they had to write out these numbers every time they discussed them, their work would soon be encumbered by lengthy numerical expressions. For this purpose, they use scientific notation, a method for writing extremely large or small numbers by representing them as a number between 1 and 10 multiplied by a power of 10.
Instead of writing 75,120,000, for instance, the preferred scientific notation is 7.512 · 107. To interpret the value of large multiples of 10, it is helpful to remember that the value of 10 raised to any power n is the same as 1 followed by that number of zeroes. Hence 1025, for instance, is simply 1 followed by 25 zeroes.
Scientific notation is just as useful—to chemists in particular—for rendering very small numbers. Suppose a sample of a chemical compound weighed 0.0007713 grams. The preferred scientific notation, then, is 7.713 · 10−4. Note that for numbers less than 1, the power of 10 is a negative number: 10−1 is 0.1, 10−2 is 0.01, and so on.
Again, there is an easy rule of thumb for quickly assessing the number of decimal places where scientific notation is used for numbers less than 1. Where 10 is raised to any power −n, the decimal point is followed by n places. If 10 is raised to the power of −8, for instance, we know at a glance that the decimal is followed by 7 zeroes and a 1.
In making measurements, there will always be a degree of uncertainty. Of course, when the standards of calibration (discussed below) are very high, and the measuring instrument has been properly calibrated, the degree of uncertainty will be very small. Yet there is bound to be uncertainty to some degree, and for this reason, scientists use significant figures—numbers included in a measurement, using all certain numbers along with the first uncertain number.
Suppose the mass of a chemical sample is measured on a scale known to be accurate to 10−5, kg. This is equal to 1/100,000 of a kilo, or 1/100 of a gram; or, to put it in terms of place-value, the scale is accurate to the fifth place in a decimal fraction. Suppose, then, that an item is placed on the scale, and a reading of 2.13283697 kg is obtained. All the numbers prior to the 6 are significant figures, because they have been obtained with certainty. On the other hand, the 6 and the numbers that follow are not significant figures because the scale is not known to be accurate beyond 10−5 kg.
Thus the measure above should be rendered with 7 significant figures: the whole number 2, and the first 6 decimal places. But if the value is given as 2.132836, this might lead to inaccuracies at some point when the measurement is factored into other equations. The 6, in fact, should be "rounded off" to a 7. Simple rules apply to the rounding off of significant figures: if the digit following the first uncertain number is less than 5, there is no need to round off. Thus, if the measurement had been 2.13283627 kg (note that the 9 was changed to a 2), there is no need to round off, and in this case, the figure of 2.132836 is correct. But since the number following the 6 is in fact a 9, the correct significant figure is 7; thus the total would be 2.132837.
Fundamental Standards of Measure
So much for numbers; now to the subject of units. But before addressing systems of measurement, what are the properties being measured? All forms of scientific measurement, in fact, can be reduced to expressions of four fundamental properties: length, mass, time, and electric current. Everything can be expressed in terms of these properties: even the speed of an electron spinning around the nucleus of an atom can be shown as "length" (though in this case, the measurement of space is in the form of a circle or even more complex shapes) divided by time.
Of particular interest to the chemist are length and mass: length is a component of volume, and both length and mass are elements of density. For this reason, a separate essay in this book is devoted to the subject of Mass, Density, and Volume. Note that "length," as used in this most basic sense, can refer to distance along any plane, or in any of the three dimensions—commonly known as length, width, and height—of the observable world. (Time is the fourth dimension.) In addition, as noted above, "length" measurements can be circular, in which case the formula for measuring space requires use of the coefficient π, roughly equal to 3.14.
Standardized Units of Measure: Who Needs Them?
People use units of measure so frequently in daily life that they hardly think about what they are doing. A motorist goes to the gas station and pumps 13 gallons (a measure of volume) into an automobile. To pay for the gas, the motorist uses dollars—another unit of measure, economic rather than scientific—in the form of paper money, a debit card, or a credit card.
This is simple enough. But what if the motorist did not know how much gas was in a gallon, or if the motorist had some idea of a gallon that differed from what the gas station management determined it to be? And what if the value of a dollar were not established, such that the motorist and the gas station attendant had to haggle over the cost of the gasoline just purchased? The result would be a horribly confused situation: the motorist might run out of gas, or money, or both, and if such confusion were multiplied by millions of motorists and millions of gas stations, society would be on the verge of breakdown.
THE VALUE OF STANDARDIZATION TO A SOCIETY.
Actually, there have been times when the value of currency was highly unstable, and the result was near anarchy. In Germany during the early 1920s, for instance, rampant inflation had so badly depleted the value of the mark, Germany's currency, that employees demanded to be paid every day so that they could cash their paychecks before the value went down even further. People made jokes about the situation: it was said, for instance, that when a woman went into a store and left a basket containing several million marks out front, thieves ran by and stole the basket—but left the money. Yet there was nothing funny about this situation, and it paved the way for the nightmarish dictatorship of Adolf Hitler and the Nazi Party.
It is understandable, then, that standardization of weights and measures has always been an important function of government. When Ch'in Shih-huang-ti (259-210 b.c.) united China for the first time, becoming its first emperor, he set about standardizing units of measure as a means of providing greater unity to the country—thus making it easier to rule. On the other hand, the Russian Empire of the late nineteenth century failed to adopt standardized systems that would have tied it more closely to the industrialized nations of Western Europe. The width of railroad tracks in Russia was different than in Western Europe, and Russia used the old Julian calendar, as opposed to the Gregorian calendar adopted throughout much of Western Europe after 1582. These and other factors made economic exchanges between Russia and Western Europe extremely difficult, and the Russian Empire remained cut off from the rapid progress of the West. Like Germany a few decades later, it became ripe for the establishment of a dictatorship—in this case under the Communists led by V. I. Lenin.
Aware of the important role that standardization of weights and measures plays in the governing of a society, the U.S. Congress in 1901 established the Bureau of Standards. Today it is known as the National Institute of Standards and Technology (NIST), a nonregulatory agency within the Commerce Department. As will be discussed at the conclusion of this essay, the NIST maintains a wide variety of standard definitions regarding mass, length, temperature and so forth, against which other devices can be calibrated.
THE VALUE OF STANDARDIZATION TO SCIENCE.
What if a nurse, rather than carefully measuring a quantity of medicine before administering it to a patient, simply gave the patient an amount that "looked right"? Or what if a pilot, instead of calculating fuel, distance, and other factors carefully before taking off from the runway, merely used a "best estimate"? Obviously, in either case, disastrous results would be likely to follow. Though neither nurses or pilots are considered scientists, both use science in their professions, and those disastrous results serve to highlight the crucial matter of using standardized measurements in science.
Standardized measurements are necessary to a chemist or any scientist because, in order for an experiment to be useful, it must be possible to duplicate the experiment. If the chemist does not know exactly how much of a certain element he or she mixed with another to form a given compound, the results of the experiment are useless. In order to share information and communicate the results of experiments, then, scientists need a standardized "vocabulary" of measures.
This "vocabulary" is the International System of Units, known as SI for its French name, Système International d'Unités. By international agreement, the worldwide scientific community adopted what came to be known as SI at the 9th General Conference on Weights and Measures in 1948. The system was refined at the 11th General Conference in 1960, and given its present name; but in fact most components of SI belong to a much older system of weights and measures developed in France during the late eighteenth century.
SI vs. the English System
The United States, as almost everyone knows, is the wealthiest and most powerful nation on Earth. On the other hand, Brunei—a tiny nation-state on the island of Java in the Indonesian archipelago—enjoys considerable oil wealth, but is hardly what anyone would describe as a super-power. Yemen, though it is located on the Arabian peninsula, does not even possess significant oil wealth, and is a poor, economically developing nation. Finally, Burma in Southeast Asia can hardly be described even as a "developing" nation: ruled by an extremely repressive military regime, it is one of the poorest nations in the world.
So what do these four have in common? They are the only nations on the planet that have failed to adopt the metric system of weights and measures. The system used in the United States is called the English system, though it should more properly be called the American system, since England itself has joined the rest of the world in "going metric." Meanwhile, Americans continue to think in terms of gallons, miles, and pounds; yet American scientists use the much more convenient metric units that are part of SI.
HOW THE ENGLISH SYSTEM WORKS (OR DOES NOT WORK).
Like methods of counting described above, most systems of measurement in premodern times were modeled on parts of the human body. The foot is an obvious example of this, while the inch originated from the measure of a king's first thumb joint. At one point, the yard was defined as the distance from the nose of England's King Henry I to the tip of his outstretched middle finger.
Obviously, these are capricious, downright absurd standards on which to base a system of measure. They involve things that change, depending for instance on whose foot is being used as a standard. Yet the English system developed in this willy-nilly fashion over the centuries; today, there are literally hundreds of units—including three types of miles, four kinds of ounces, and five kinds of tons, each with a different value.
What makes the English system particularly cumbersome, however, is its lack of convenient conversion factors. For length, there are 12 inches in a foot, but 3 feet in a yard, and 1,760 yards in a mile. Where volume is concerned, there are 16 ounces in a pound (assuming one is talking about an avoirdupois ounce), but 2,000 pounds in a ton. And, to further complicate matters, there are all sorts of other units of measure developed to address a particular property: horsepower, for instance, or the British thermal unit (Btu).
THE CONVENIENCE OF THE METRIC SYSTEM.
Great Britain, though it has long since adopted the metric system, in 1824 established the British Imperial System, aspects of which are reflected in the system still used in America. This is ironic, given the desire of early Americans to distance themselves psychologically from the empire to which their nation had once belonged. In any case, England's great worldwide influence during the nineteenth century brought about widespread adoption of the English or British system in colonies such as Australia and Canada. This acceptance had everything to do with British power and tradition, and nothing to do with convenience. A much more usable standard had actually been embraced 25 years before in a land that was then among England's greatest enemies: France.
During the period leading up to and following the French Revolution of 1789, French intellectuals believed that every aspect of existence could and should be treated in highly rational, scientific terms. Out of these ideas arose much folly, particularly during the Reign of Terror in 1793, but one of the more positive outcomes was the metric system. This system is decimal—that is, based entirely on the number 10 and powers of 10, making it easy to relate one figure to another. For instance, there are 100 centimeters in a meter and 1,000 meters in a kilometer.
PREFIXES FOR SIZES IN THE METRIC SYSTEM.
For designating smaller values of a given measure, the metric system uses principles much simpler than those of the English system, with its irregular divisions of (for instance) gallons, quarts, pints, and cups. In the metric system, one need only use a simple Greek or Latin prefix to designate that the value is multiplied by a given power of 10. In general, the prefixes for values greater than 1 are Greek, while Latin is used for those less than 1. These prefixes, along with their abbreviations and respective values, are as follows. (The symbol μ for "micro" is the Greek letter mu.)
The Most Commonly Used Prefixes in the Metric System
- giga (G) = 109 (1,000,000,000)
- mega (M) = 106 (1,000,000)
- kilo (k) == 103 (1,000)
- deci (d) = 10−1 (0.1)
- centi (c) = 10−2 (0.01)
- milli (m) = 10−3 (0.001)
- micro (μ) = 10−6 (0.000001)
- nano (n) = 10−9 (0.000000001)
The use of these prefixes can be illustrated by reference to the basic metric unit of length, the meter. For long distances, a kilometer (1,000 m) is used; on the other hand, very short distances may require a centimeter (0.01 m) or a millimeter (0.001 m) and so on, down to a nanometer (0.000000001 m). Measurements of length also provide a good example of why SI includes units that are not part of the metric system, though they are convertible to metric units. Hard as it may be to believe, scientists often measure lengths even smaller than a nanometer—the width of an atom, for instance, or the wavelength of a light ray. For this purpose, they use the angstrom (Å or A), equal to 0.1 nanometers.
Calibration and SI Units
THE SEVEN BASIC SI UNITS.
The SI uses seven basic units, representing length, mass, time, temperature, amount of substance, electric current, and luminous intensity. The first four parameters are a part of everyday life, whereas the last three are of importance only to scientists. "Amount of substance" is the number of elementary particles in matter. This is measured by the mole, a unit discussed in the essay on Mass, Density, and Volume. Luminous intensity, or the brightness of a light source, is measured in candelas, while the SI unit of electric current is the ampere.
The other four basic units are the meter for length, the kilogram for mass, the second for time, and the degree Celsius for temperature. The last of these is discussed in the essay on Temperature; as for meters, kilograms, and seconds, they will be examined below in terms of the means used to define each.
Calibration is the process of checking and correcting the performance of a measuring instrument or device against the accepted standard. America's preeminent standard for the exact time of day, for instance, is the United States Naval Observatory in Washington, D.C. Thanks to the Internet, people all over the country can easily check the exact time, and calibrate their clocks accordingly—though, of course, the resulting accuracy is subject to factors such as the speed of the Internet connection.
There are independent scientific laboratories responsible for the calibration of certain instruments ranging from clocks to torque wrenches, and from thermometers to laser-beam power analyzers. In the United States, instruments or devices with high-precision applications—that is, those used in scientific studies, or by high-tech industries—are calibrated according to standards established by the NIST.
The NIST keeps on hand definitions, as opposed to using a meter stick or other physical model. This is in accordance with the methods of calibration accepted today by scientists: rather than use a standard that might vary—for instance, the meter stick could be bent imperceptibly—unvarying standards, based on specific behaviors in nature, are used.
METERS AND KILOGRAMS.
A meter, equal to 3.281 feet, was at one time defined in terms of Earth's size. Using an imaginary line drawn from the Equator to the North Pole through Paris, this distance was divided into 10 million meters. Later, however, scientists came to the realization that Earth is subject to geological changes, and hence any measurement calibrated to the planet's size could not ultimately be reliable. Today the length of a meter is calibrated according to the amount of time it takes light to travel through that distance in a vacuum (an area of space devoid of air or other matter). The official definition of a meter, then, is the distance traveled by light in the interval of 1/299,792,458 of a second.
One kilogram is, on Earth at least, equal to 2.21 pounds; but whereas the kilogram is a unit of mass, the pound is a unit of weight, so the correspondence between the units varies depending on the gravitational field in which a pound is measured. Yet the kilogram, though it represents a much more fundamental property of the physical world than a pound, is still a somewhat arbitrary form of measure in comparison to the meter as it is defined today.
Given the desire for an unvarying standard against which to calibrate measurements, it would be helpful to find some usable but unchanging standard of mass; unfortunately, scientists have yet to locate such a standard. Therefore, the value of a kilogram is calibrated much as it was two centuries ago. The standard is a bar of platinum-iridium alloy, known as the International Prototype Kilogram, housed near Sévres in France.
A second, of course, is a unit of time as familiar to non-scientifically trained Americans as it is to scientists and people schooled in the metric system. In fact, it has nothing to do with either the metric system or SI. The means of measuring time on Earth are not "metric": Earth revolves around the Sun approximately every 365.25 days, and there is no way to turn this into a multiple of 10 without creating a situation even more cumbersome than the English units of measure.
The week and the month are units based on cycles of the Moon, though they are no longer related to lunar cycles because a lunar year would soon become out-of-phase with a year based on Earth's rotation around the Sun. The continuing use of weeks and months as units of time is based on tradition—as well as the essential need of a society to divide up a year in some way.
A day, of course, is based on Earth's rotation, but the units into which the day is divided—hours, minutes, and seconds—are purely arbitrary, and likewise based on traditions of long standing. Yet scientists must have some unit of time to use as a standard, and, for this purpose, the second was chosen as the most practical. The SI definition of a second, however, is not simply one-sixtieth of a minute or anything else so strongly influenced by the variation of Earth's movement.
Instead, the scientific community chose as its standard the atomic vibration of a particular isotope of the metal cesium, cesium-133. The vibration of this atom is presumed to be unvarying, because the properties of elements—unlike the size of Earth or its movement—do not change. Today, a second is defined as the amount of time it takes for a cesium-133 atom to vibrate 9,192,631,770 times. Expressed in scientific notation, with significant figures, this is 9.19263177 · 109.
WHERE TO LEARN MORE
Gardner, Robert. Science Projects About Methods of Measuring. Berkeley Heights, N.J.: Enslow Publishers, 2000.
Long, Lynette. Measurement Mania: Games and Activities That Make Math Easy and Fun. New York: Wiley, 2001.
"Measurement" (Web site). <http://www.dist214.k12.il.us/users/asanders/meas.html> (May 7, 2001).
"Measurement in Chemistry" (Web site). <http://bradley.edu/~campbell/lectnotes/149ch2/tsld001.htm> (May7, 2001).
MegaConverter 2 (Web site). <http://www.megaconverter.com> (May 7, 2001).
Patilla, Peter. Measuring. Des Plaines, IL: Heinemann Library, 2000.
Richards, Jon. Units and Measurements. Brookfield, CT: Copper Beech Books, 2000.
Sammis, Fran. Measurements. New York: Benchmark Books, 1998.
Units of Measurement (Web site). <http://www.unc.edu/~rowlett/units/> (May 7, 2001).
Wilton High School Chemistry Coach (Web site). <http://www.chemistrycoach.com> (May 7, 2001).
The process of checking and correcting the performance of a measuring instrument or device against a commonly accepted standard.
A method used by scientists for writing extremely large or small numbers by representing them as a number between 1 and 10 multiplied by a power of 10. Instead of writing 0.0007713, the preferred scientific notation is 7.713 · 10−4.
An abbreviation of the French term Système International d'Unités, or International System of Units. Based on the metric system, SI is the system of measurement units in use by scientists worldwide.
Numbers included in a measurement, using all certain numbers along with the first uncertain number.
The process or technique of correlating numbers with things that are not patently numbered in the order of nature; also, the relation that arises from such a process. Measurement is usually affected by comparing observable phenomena with a suitable metric, although sometimes it is the result of a mathematical calculation based on data that are not directly accessible to experience. As employed in the physical sciences, the process of measurement is itself an interaction between a measuring instrument and the thing measured, and on this account is dependent for its objective validity on corrections (sometimes involving theoretical interpretations) to account for the perturbing effect of the instrument.
This article first presents a philosophical analysis of measurement in general, then considers specific problems associated with measurement in psychology, and concludes with a discussion of mathematical aspects of the measuring process.
MEASUREMENT IN GENERAL
Measurement, according to St. thomas aquinas, is the process by which the quantity of a thing is made known (In 1 sent. 8.4.2 ad 3). It is applied directly to physical bodies (1) when their discrete quantity is ascertained, e.g., by counting the number of objects in a room, or (2) when their continuous quantity is measured, e.g., by using a scale to determine individual lengths. In current practice the term measurement is sometimes applied to counting, but is more usually reserved for determinations of dimensive or continuous quantity.
Quantitative Measurement. The elements involved in direct measurement can be explained in terms of the requirements for a quantitative measurement such as the determination of length. Such measurement first presupposes a unit; the unit may be one that occurs naturally, such as the foot, or it may be one fixed by convention. The choice of a conventional unit is not completely arbitrary, but is dictated by the unit's suitability as a minimum dimension into which lengths can be divided.
Secondly, the unit used must be homogeneous with the thing measured (In 1 anal. post. 36.11). For example, if length is to be determined, the unit must be a length. Similarly, the thing measured must be uniformly structured and continuous to permit the application of the same unit to each of its parts.
A third requirement is that the unit of measurement and the object being measured must be invariant throughout the measuring process (Summa theologiae 1a2ae, 91.3 ad 3; 97.1 ad 2). This ideal is never completely realized for any physical object, since all bodies continually undergo change. Because of such variation, as well as the infinite variety of contingent circumstances that accompany any measuring process, every measurement is at best an approximation. Yet a practical invariance is not only detectable, but is more or less guaranteed by the nature of both the object measured and the standard used. For example, a person's body temperature, although varying over a small range, is held constant by natural causes. Similarly, the unit of time is determined by the rotation of the earth and the gram by the weight of one cubic centimeter of water, both of which are maintained constant through the regularity of nature's operation.
A fourth requirement is that measurement involves a judgment of comparison between the object measured and the measuring unit (Summa theologiae 1a, 79.9 ad 4). Such a judgment is an intellectual operation, although it presupposes a physical process. The program associated with operationalism to reduce every measurement to the manipulation of instruments alone thus disregards an essential feature of the measuring process. Instruments cannot measure. Ultimately they require mind, which, because of its reflexive character as a "self-reading instrument," can effect the judgment of comparison and make the measurement.
These requirements for the direct measurement of quantity or of bodily extension are applicable to spatio-temporal measurements (see time). They can be applied also to other entities, such as certain types of quality, but not without some adaptations, as will now be explained.
Qualitative Measurement. Physical qualities, because present in quantified bodies and intimately associated with the quantity of such bodies, can themselves be said to be quantified. Their quantity can be measured in two different ways, giving rise to the two measurements that are usually associated with physical quality, viz, extensive and intensive quantification from the extension of the body in which they are present; thus there is a greater amount of heat in a large body than in a small body, assuming both to be at the same temperature (cf. De virt. in comm. 11 ad 10). They receive intensive quantification, on the other hand, from the degree of intensity of a particular quality in the body (ibid.; Summa theologiae 1a, 42.1 ad 1). If two bodies are at different temperatures, for example, there is a more intense heat in the body at the higher temperature, or it is the hotter, and this regardless of the size of either.
Measurement of the extensive aspect of physical qualities, being effectively the same as the measurement of length, area, and volume, has the same requirements as that for quantitative measurement. Measurement of the intensive aspect, on the other hand, is more difficult and requires slightly different techniques.
Two possibilities suggest themselves for the measurement of a quality's intensive aspect. The simplest is to arrange objects with a given quality in the order of increasing intensity, and then number them consecutively. For example, if bodies be arranged according to increasing hotness as discernible by touch, and these bodies be numbered, the higher number indicates the greater degree of heat. This is the closest one can come to a direct intensive measurement of quality. Such a measure offers difficulties, however, because of the subjectivity of sensation and the arbitrariness of assigning numbers depending on the number of objects that happen to be compared.
The other possibility is that of determining the intensity of a quality (1) from an effect, i.e., the change the quality produces in a body other than that in which it is subjected, or (2) from a cause, i.e., the agent that produces the quality's intensity in the subject.
Effect. If the quality is an active one, i.e., if it produces alterations in other bodies, it can be measured by the effect it produces in such bodies. This is usually done through special types of bodies known as instruments. Thus heat intensity is measured by a thermometer containing a substance that expands noticeably when contacting a hot object. Similarly, the intensity of sound is measured by vibrations produced in a microphone, and light intensity by electric current generated in a photocell. In each case, the intensity of an active quality in one subject is measured by the quantity of the effect it produces in a receiving subject, which is known on this account as the measuring instrument.
Active qualities, it may be noted, can sometimes be measured independently of external alterations of the type just mentioned. If they induce pronounced quantitative changes in the subject in which they are present, they can be measured directly through measurement of the subject body. In this way the temperature of mercury in an immersion thermometer is measured simply by reading the length of its own expansion. Similarly, the wavelength of sound in a resonating chamber of variable length is measured directly, using a standing wave technique to ascertain the length of the vibrating column. Such a method of concomitant variation, however, while of theoretical interest, is of limited applicability, since it is restricted to bodies that are quantitatively sensitive to the presence of the qualities being discussed.
Cause. If a quality is not particularly active, i.e., if it does not produce pronounced effects in itself or in another body, its intensity can alternatively be measured through some type of causality required to produce it in the subject body. In this way, one measures the intensity of light on a reflecting surface by the number of footcandles emitted by the source illuminating the surface. A variation on this technique is that of using an instrumental cause to measure some modality of the principal cause that actively produces the quality. An example would be using a prism or ruled grating selectively to refract and measure the wavelength of colored light incident on an opaque surface, and in this way indirectly to measure the ability of the surface to reflect light of a particular color.
All these methods are indirect ways of measuring qualitative intensity through a cause-effect relationship. All involve techniques whereby a precise quantity is assigned to the quality being measured, and on this account are considerably more accurate and objective than direct ordinal measures of qualitative intensities. As a consequence, these constitute the type of qualitative measurement most widely used in the physical sciences.
Accuracy. As employed in physical science, a measurement cannot be made to an infinite degree of accuracy. There are two reasons why this is so. The first is that all such measurements reduce to a measurement of continuous quantity, and the only way in which number can be assigned to such quantity is in terms of a conventional unit. For infinite accuracy, this unit would have to approach zero as a limiting case. Attaining the limit would itself involve a contradiction in terms, since a number cannot be assigned to a unit of zero, or nonexistent, magnitude. The second limitation arises from specifying the conditions that attend a particular measuring process. Since these involve details that are themselves infinitely variable, they can be only approximately specified. For all practical purposes, however, it is possible to specify the range of magnitudes between which a given measurement is accurate, depending upon the unit involved and the circumstances of measurement. Some Thomistic philosophers regard such accuracy as sufficient to permit a demonstration with the certitude that is proper to physical science, although not with that proper to mathematics, while others see this as sufficient reason for questioning the strictly demonstrative character of any conclusion of modern science that is based upon a measuring process that has the above limitations.
Bibliography: w. a. wallace, "The Measurement and Definition of Sensible Qualities," The New Scholasticism 39 (1965) 1–25. m. heath, "Can Qualities Be Measured?" Thomist 18 (1955) 31–60. v. e. smith, "Toward a Philosophy of Physical Instruments," ibid. 10 (1947) 307–33. p. h. j. hoenen, Cosmologia (5th ed. Rome 1956).
[w. a. wallace]
MEASUREMENT IN PSYCHOLOGY
In psychology, the term measurement means the assigning of numbers to quantitative variations in a distinguishable attribute of behavior, or of behaviorally related objects, with the expectation that something true or predictable may be derived from their relationship with other variables. The logic of measurement is concerned primarily with the construction of a scale or measuring device, and secondly with the application of that scale to a particular behavior or object, such as occurs in psychological testing.
Requirements for Measurement. Quantitative indexes of behavior, such as number of errors, perception time, and number of words recalled are often employed in laboratories of experimental psychology, but in a fairly simple way—e.g., as a convenience for the experimenter in distinguishing and recording various performances, where the only assumption involved is that the behaviors can be properly and meaningfully ranked. The numbers themselves refer to physical units and are not commonly scaled psychologically. Psychological scaling involves the following special features:
Isolation and Identification of a Dimension. What is measured is not an object, strictly speaking, but rather some property or dimension associated with an object, either directly, at the level of sensory observation, or indirectly, through the type of indicant specified by an operational definition. Such a property must first be qualitatively distinguished from other properties and seen as capable of quantification. Not all psychological properties are measurable, for measurability depends upon whether or not the property can be conceived in a quantitative way.
Human Significance. The numbers employed in a psychological scale must represent a value indicative of the experience or performance of the human subject, as opposed to a value inherent in the physical nature of the stimulus, situation, or condition evoking the behavior. Physics, for example, using the human subject as mediating observer, employs a physical system—e.g., a thermometer, based on a law relating liquid expansion to heat—to eliminate or reduce subjective variations within the sensing organism. The psychologist, however, is directly concerned with variations in sensation and with human performance; thus the numbers he uses must reflect, or be isomorphic with, variations in psychological meaningfulness. The basic law employed must include behavior as one of its terms.
Rule for Assigning Numbers. The derivation of a psychological scale involves perceiving or performing subjects on the one hand, and stimuli (physical objects, situations, words, tasks, or problems) on the other, and a search for some functional law relating the two. Systematic variations in the responses of the subjects, when quantified in terms of the human variation itself, become the key to establishing the function relating stimuli and responses and to assigning one, and only one, number to each object.
Usable Properties of Numbers. Not all arithmetical properties of numbers are usable in psychological scaling. The number four, for example, is greater than two (order); it is also greater by a definite amount (distance or interval); and it is counted off from zero (origin—in this case, an absolute zero) and contains two by addition or division (composition or extension). Corresponding to these distinguishable properties, scales are commonly identified as ordinal, interval, or ratio scales. The difficulty of establishing an absolute zero in psychological matters restricts the use of ratio scales, and such matters offer little opportunity for using the extensive or composite properties of numbers.
Examples of Psychological Scales. Commonly employed psychological scales include psychophysical scales, attitude scales, product and mental test scales, and multidimensional and other types of scaling procedures.
Psychophysical Scales. An instance of a psychophysical scale is the measurement of the loudness of a sound based on the Weber-Fechner law, which states that the intensity of a sensation increases as the logarithm of the stimulus. A unit difference between the logarithms of two physical sound pressures is divided into ten equal steps, called decibels. The zero is set at the point of the absolute threshold, the weakest sound that can be heard, and the scale extends to cover a range of about 140 decibels.
Attitude Scales. These are commonly derived from a large number of statements of opinion, favorable and unfavorable, about some commonly known subject such as communism or about a debatable social custom or institution. Agreement among a group of judges as to how favorable or unfavorable each opinion-statement may be is transformed into a scale value, using the law of comparative judgment. This law states that the psychological difference between items is a function of the relative frequency with which the difference is perceived. The zero is placed arbitrarily low.
Product and Mental Test Scales. Product scales, used for rating specimens of handwriting, soldering joints (in trade tests), art work, or other kinds of cultural product, are similarly based on the law of comparative judgment and have an arbitrary zero point. Mental test scales, such as those used in psychological testing, are based on a statistical analysis of the performance of a group of homogeneous subjects on each item.
Other Types. Multidimensional scaling is employed to discover the number of dimensions involved in a particular phenomenon and to rate each object on the various dimensions. An example of this type of scaling is the "semantic differential," which is used to measure the connotative meaning of common words. Other scales are used for the associative value of common words or of nonsense syllables, the frequency rating of associated responses to a standard word list, the frequency rating of words in common use, and ratings of abnormal or psychotic behavior in terms of basic trait content.
Role of Measurement in Psychology. Psychological measurement is based on empiriological properties that lend themselves readily to conceptual quantification and identification in terms of observed indicants. It thus serves to supplement logical definitions and to extend these into an area of finer objective discrimination. Among its other contributions, the following may be enumerated: (1) increased precision in identifying instances of the occurrence of a property; (2) better contexts of meaningfulness, to the extent that the assignment of numbers is based on behavioral laws; (3) evaluations of the influence of empirically established relationships on tentative definitions of objects or properties; and (4) more reliable inferences of causal relationships as these are discernible through the application of the principle of concomitant variation.
Bibliography: w. s. togerson, Theory and Methods of Scaling (New York 1958). conference on the history of quantification in the sciences, Quantification of the Meaning of Measurement in the Natural and Social Sciences, ed. h. woolf (Indianapolis 1961). c. w. churchman and p. ratoosh, eds., Measurement: Definitions and Theories (New York 1959).
[w. d. commins]
From the viewpoint of the mathematician, measurement is the determination of the value of a measure function of a given attribute of an object. A measure function is a rule that correlates a set of attributes with a set of elements (usually numbers) of an algebra. For example, the measure function called "length," which one may denote by L, is a rule or set of procedures that associates the set of attributes of "extended objects" with the set of real numbers for which addition and multiplication are already defined. The length of the edge of a table may be denoted by L (t ), in which case, for example, L (t ) = 3 feet 6 inches. The rule denoted by L is the set of procedures for measuring the length of the table, which results in the correlation "length of table is 3'6"."
The element correlated with "length of table," or more generally "length of x, " is not merely a real number; it is a dimension number, i.e., a number plus a dimension, as, for example, "3 feet." The term "feet" refers to a unit of measurement previously established. "To measure" thus signifies that one knows a set of procedures and a unit of measurement (a dimension) such that, by applying the set of procedures, he can associate a unique number of units of measurement with a given measurable.
Requirements for Measure Functions. A minimal set of conditions imposed on a measure function are: (1) if m (x 1) and m (x 2) are measurements, then one and only one of m (x 1) is equal to m (x 2) or m (x 1) is less than m (x 2) or m (x 1) is greater than m (x 2) holds; (2) x 1 is equal to x 2 if and only if m (x 1) is equal to m (x 2), where x 1 and x 2 denote measurables, e.g., the length of this table and the length of that table; and (3) if m (x 1) is less than m (x 2) and m (x 2) is less than m (x 3), then m (x 1) is less than m (x 3). For a measure function using the set (or any subset of the set) of real numbers as dimension numbers, these conditions are easily satisfied.
In defining the measure function, one must also define the rule, i.e., the method of assigning a dimension number to each instance of the attribute. An illustration of this method can be seen in fixing the age of some person P. Evidently "age of P " is measured from the point "birth of P. " Moreover, to measure age one must define a dimension, i.e., a unit of measurement. In this case "calendar year," in its ordinary meaning, can serve as the unit of measurement, which is also called the metric. One correlates the birth of P with a point on a given calendar year. The age of P at birth is defined as zero years, whereas the age of P at present is the number of calendar years from zero to the present date. This process defines a mapping of two points in the set of calendar years to two events in the life of P, and consequently defines the age of P as the "distance" between the two points in the set of calendar years.
Mapping. The problems associated with mapping may be seen in the following example. Suppose that one wishes to measure food preferences of a set of adults aged 20 to 30. He then must define a measure function that maps a set of foods into the set of elements of some algebra. The most obvious algebraic set to select is the set of nonnegative integers. The unit of measurement is the preferential attitude of an adult toward a food. The measure function m (p) is definable as follows: m (p) (F ) = order of F in the ranking by person. So m (p) (cheese) = 3 means that a given individual ranks cheese third on his list.
There are difficulties involved in this procedure, however. For example, it is not clear how to define m (p) (F ) = zero. The metric is neither precise nor unambiguous. The measure function gives different results at different times and for different persons. The fact that, for two individuals, the measure function for the same food gives different integers does not enable one to make any significant comparison between the two measures. Finally, the fact that the measure function for a given individual yields m (p) (F 1) = a and m (p) (F 2) = b does not provide any significant conclusion concerning the relation of the two preferences.
Whether or not a measure function can be defined for a particular attribute is an empirical problem. But when a measure function can be found that correlates the measurables with some algebra, particularly that of the real number system, the whole apparatus of mathematics becomes available for inferences. Assuming translation from mathematical equations to attributes, the mechanical derivation of mathematical consequences then suggests phenomena that may be related to the measurables.
In the empirical sciences, concrete models of the metric are often constructed, e.g., clocks, meters, and scales. Some concrete entity or objective phenomenon is then used to determine the number of units to be mapped to a particular appearance of the attribute being measured. Such an instrument can be considered to define the method of determining the measure.
Error in Measurement. There are two major sources of error in measurement: (1) the definition of the measure function, and (2) the construction and use of the measuring instrument. Since the use of the measuring instrument involves the recognition of the coincidence of "points," the degree of accuracy with which the points can be seen to coincide influences the accuracy of the measurement. The size of the unit used also affects the accuracy. For example, a length measured by a measuring instrument graduated in inches is accurate only to within ½"; thus a length of 12" is really a length between 11½" and 12½". If the length is measured with an instrument graduated to ½", a length of 12" means a length between 11¾" and 12¼". Further errors arise if, unknown to the observer, the conditions under which the measurements are taken cause the measuring instrument to be altered, or the attribute measured to be affected.
Bibliography: c. g. hempel, Fundamentals of Concept Formation in Empirical Science (Chicago 1952). h. margenau, The Nature of Physical Reality: A Philosophy of Modern Physics (New York 1959). j. g. kemeny, A Philosopher Looks at Science (Princeton 1959). s. s. stevens, "On the Theory of Scales of Measurement," Science 103 (1946) 677–80.
[l. o. kattsoff]
British mathematician and physicist William Thomson (1824–1907), otherwise known as Lord Kelvin, indicated the importance of measurement to science:
When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of science, whatever the matter may be.
Possibly the most striking application of Kelvin's words is to the explanation of combustion by the French chemist Antoine Lavoisier (1743–1794). Combustion was confusing to scientists of the time because some materials, such as wood, seemed to decrease in mass on burning: Ashes weigh less than wood. In contrast, others, including iron, increased in mass: Rust weighs more than iron. Lavoisier was able to explain that combustion results when oxygen in the air unites with the material being burned, after careful measurement of the masses of the reactants—air and the material to be burned—and those of the products. Because Lavoisier was careful to capture all products of combustion, it was clear that the reason wood seemed to lose mass on burning was because one of its combustion products is a gas, carbon dioxide, which had been allowed to escape.
Lavoisier's experiments and his explanations of them and of the experiments of others are often regarded as the beginning of modern chemistry. It is not an exaggeration to say that modern chemistry is the result of careful measurement.
Most people think of measurement as a simple process. One simply finds a measuring device, uses it on the object to be measured, and records the result. Careful scientific measurement is more involved than this and must be thought of as consisting of four steps, each one of which is discussed here: choosing a measuring device, selecting a sample to be measured, making a measurement, and interpreting the results.
Choosing a Measuring Device
The measuring device one chooses may be determined by the devices available and by the object to be measured. For example, if it were necessary to
determine the mass of a coin, obviously inappropriate measuring devices would include a truck scale (reading in units of 20 pounds, with a 10-ton capacity), bathroom scale (in units of 1 pound, with a 300-pound capacity), and baby scale (in units of 0.1 ounce, with a 30-pound capacity). None of these is capable of determining the mass of so small an object. Possibly useful devices include a centigram balance (reading in units of 0.01 gram, with a 500-gram capacity), milligram balance (in units of 0.001 gram, with a 300-gram capacity), and analytical balance (in units of 0.00001 gram, with a 100-gram capacity). Even within this limited group of six instruments, those that are suitable differ if the object to be measured is an approximately one-kilogram book instead of a coin. Then only the bathroom scale and baby scale will suffice.
In addition, it is essential that that the measuring device provide reproducible results. A milligram balance that yields successive measurements of 3.012, 1.246, 8.937, and 6.008 grams for the mass of the same coin is clearly faulty. One can check the reliability of a measuring device by measuring a standard object, in part to make sure that measurements are reproducible. A common measuring practice is to intersperse samples of known value within a group of many samples to be measured. When the final results are tallied, incorrect values for the known samples indicate some fault, which may be that of the measuring device, or that of the experimenter. In the example of measuring the masses of different coins, one would include several "standard" coins, the mass of each being very well known.
Selecting a Sample
There may be no choice of sample because the task at hand may be simply that of measuring one object, such as determining the mass of a specific coin. If the goal is to determine the mass of a specific kind of coin, such as a U.S. penny, there are several questions to be addressed, including the following. Are uncirculated or worn coins to be measured? Worn coins may have less mass because copper has worn off, or more mass because copper oxide weighs more than copper and dirt also adds mass. Are the coins of just one year to be measured? Coin mass may differ from year to year. How many coins should be measured to obtain a representative sample? It is likely that there is a slight variation in mass among coins and a large enough number of coins should be measured to encompass that variation. How many sources (banks or stores) should be visited to obtain samples? Different batches of new coins may be sent to different banks; circulated coins may be used mostly in vending machines and show more wear as a result.
The questions asked depend on the type of sample to be measured. If the calorie content of breakfast cereal is to be determined, the sampling questions include how many factories to visit for samples, whether to sample unopened or opened boxes of cereal, and the date when the breakfast sample was manufactured, asked for much the same reason that similar questions were advanced about coins. In addition, other questions come to mind. How many samples should be taken from each box? From where in the box should samples be taken? May samples of small flakes have a different calorie content than samples of large flakes?
These sampling questions are often the most difficult to formulate but they are also the most important to consider in making a measurement. The purpose of asking them is to obtain a sample that is as representative as possible of the object being measured, without repeating the measurement unnecessarily. Obviously, a very exact average mass of the U.S. penny can be obtained by measuring every penny in circulation. This procedure would be so time-consuming that it is impractical, in addition to being expensive.
Making a Measurement
As mentioned above, making a measurement includes verifying that the measuring device yields reproducible results, typically by measuring standard samples. Another reason for measuring standard samples is to calibrate the measuring instrument. For example, a common method to determine the viscosity of a liquid—its resistance to flow—requires knowing the density of that liquid and the time that it takes for a definite volume of liquid to flow through a thin tube, within a device called a viscometer. It is very difficult to construct duplicate viscometers that have exactly the same length and diameter of that tube. To overcome this natural variation, a viscometer is calibrated by timing the flow of a pure liquid whose viscosity is known—such as water—through it. Careful calibration involves timing the flow of a standard volume of more than one pure liquid.
Calibration not only accounts for variations in the dimensions of the viscometer. It also compensates for small variations in the composition of the glass of which the viscometer is made, small differences in temperatures, and even differences in the gravitational acceleration due to different positions on Earth. Finally, calibration can compensate for small variations in technique from one experimenter to another.
These variations between experimenters are of special concern. Different experimenters can obtain very different values when measuring the same sample. The careful experimenter takes care to prevent bias or difference in technique from being reflected in the final result. Methods of prevention include attempting to measure different samples without knowing the identity of each sample. For instance, if the viscosities of two colorless liquids are to be measured, several different aliquots of each liquid will be prepared, the aliquots will be shuffled, and each aliquot will be measured in order. As much of the measurement as possible will be made mechanically. Rather than timing flow with a stopwatch, it is timed with an electronic device that starts and stops as liquid passes definite points.
Finally, the experimenter makes certain to observe the measurement the same way for each trial. When a length is measured with a meter stick or a volume is measured with a graduated cylinder, the eye of the experimenter is in line with or at the same level as the object being measured to avoid parallax. When using a graduated device, such as a thermometer, meter stick, or graduated cylinder, the measurement is estimated one digit more finely than the finest graduation. For instance, if a thermometer is graduated in degrees, 25.4°C (77.7°F) would be a reasonable measurement made with it, with the ".4" estimated by the experimenter.
Each measurement is recorded as it is made. It is important to not trust one's memory. In addition, it is important to write down the measurements made, not the results from them. For instance, if the mass of a sample of sodium chloride is determined on a balance, one will first obtain the mass of a container, such as 24.789 grams, and then the mass of the container with the sodium chloride present, such as 32.012 grams. It is important to record both of these masses and not just their difference, the mass of sodium chloride, 7.223 grams.
Typically, the results of a measurement involve many values, the observations of many trials. It is tempting to discard values that seem quite different from the others. This is an acceptable course of action if there is good reason to believe that the errant value was improperly measured. If the experimenter kept good records while measuring, notations made during one or more trials may indicate that an individual value was poorly obtained—for instance, by not zeroing or leveling a balance, neglecting to read the starting volume in a buret before titration, or failing to cool a dried sample before obtaining its mass.
Simply discarding a value based on its deviation from other values, without sound experimental reasons for doing so, may lead to misleading results besides being unjustified. Consider the masses of several pennies determined with a milligram balance to be: 3.107, 3.078, 3.112, 2.911,3.012, 3.091, 3.055, and 2.508 grams. Discarding the last mass because of its deviation would obscure the facts that post-1982 pennies have a zinc core with copper cladding (representing a total of about 2.4% copper), whereas pre-1982 pennies are composed of an alloy that is 95 percent copper. There are statistical tests that help in deciding whether to reject a specific value or not.
It is cumbersome, however, to report all the values that have been measured. Reporting solely the average or mean value gives no indication of how carefully the measurement has been made or how reproducible the repeated measurements are. Care in measurement is implied by the number of significant figures reported; this corresponds to the number of digits to
which one can read the measuring devices, with one digit beyond the finest graduation, as indicated earlier.
The reproducibility of measurements is a manifestation of their precision. Precision is easily expressed by citing the range of the results; a narrow range indicates high precision. Other methods of expressing precision include relative average deviation and standard deviation. Again, a small value of either deviation indicates high precision; repeated measurements are apt to replicate the values of previous ones.
When several different quantities are combined to obtain a final value—such as combining flow time and liquid density to determine viscosity—standard propagation-of-error techniques are employed to calculate the deviation in the final value from the deviations in the different quantities.
Both errors and deviations combine in the same way when several quantities are combined, even though error and deviation are quite different concepts. As mentioned above, deviation indicates how reproducible successive measurements are. Error is a measure of how close an individual value—or an average—is to an accepted value of a quantity. A measurement with small error is said to be accurate. Often, an experimenter will believe that high precision indicates low error. This frequently is true, but very precise measurements may have a uniform error, known as a systematic error. An example would be a balance that is not zeroed, resulting in masses that are uniformly high or low.
The goal of careful measurement ultimately is to determine an accepted value. Careful measurement technique—including choosing the correct measuring device, selecting a sample to be measured, making a measurement, and interpreting the results—helps to realize that goal.
see also International System of Units; Lavoisier, Antoine.
Robert K. Wismer
Youden, W. J. (1991). Experimentation and Measurement. NIST Special Publication 672. Washington, DC: National Institute of Standards and Technology.
Measurement is the evaluation or estimation of degree, extent, dimension, or capacity in relation to certain standards (i.e., units of measurement). As one of the most important inventions in human history, the process of measuring involves every aspect of our lives, such as time, mass, length, and space. The Greeks first developed the “foot” as their fundamental unit of length during the fourth millennium BCE. The ancient peoples of Mesopotamia, Egypt, and the Indus Valley seem to have all created systems of weight around the same period. Zero, the crucial number in the history of measurement, was first regarded as a true number by Aristotle.
The ancient Egyptians first developed a systematic method of measuring objects, which they used in the construction of pyramids. During the long period during which their civilization thrived in northeastern Africa, they also cultivated the earliest thoughts on earth measurement: geometry. Euclid of Alexandria (c. 330–c. 275 BCE), a Greek mathematician who lived in Egypt and who is regarded as the “father of geometry,” provided the proofs of geometric rules that Egyptians had devised in building their monuments. His most famous work, Elements, covers the basic definitions, postulates, propositions, and proofs of mathematical and geometric theorems. Euclid’s Elements has proven instrumental in the development of modern science and measurement.
Another great mathematician who contributed to modern measurement was Karl Friedrich Gauss. He was born on April 30, 1777, in Brunswick, Germany. At age twenty-four, Gauss published a brilliant work, Disquisitiones Arithmeticae, in which he established basic concepts and methods of number theory. In 1801, Gauss developed the method of least squares in calculating the orbital component of the motion of celestial bodies with high accuracy. Since that time the method of least squares has been the most widely used method in all of science to estimate the impact of measurement error. He was able to prove that a bell-shaped, normally distributed error curve is a basic assumption of statistical probability analysis (Gauss-Markov theorem).
Among all science and social science disciplines, there are three broad measurement theories (Michell 1986, 1990). The first and most commonly used is the classical theory of measurement. Measurement is defined by the magnitudes of the quantity and expressed as real numbers. An object’s quantitative properties are estimated in relation to one another, and ratios of quantities can be determined by the unit of measurement. The classical concept of measurement can be traced back to early theorists and mathematicians, including Isaac Newton and Euclid. The classical approach assumes that the underlying reality exists, but only quantitative attributes are measurable, and the meaningfulness of scientific theories can, and only can, be supported by the empirical relationships of various measurements.
The second theory, the representational approach, defines measurement as “the correlation of numbers and entities that are not numbers” (Nagel  1960, p. 121). For example, IQ scores can be used to measure intelligence, and the Likert scale measures personal attitudes based on a set of statements. The representational approach assumes that a reality exists and can be measured, and the goal of science is to understand this reality. However, the representational approach does not insist that only quantitative properties are measurable. Instead, measurements can be used to reflect differences at multiple levels.
Unlike the classical and representational approaches, the third approach, called the operational approach, avoids the assumption of objective reality. Instead, it emphasizes only the precisely specified operational process, such as the measurement of reliability and validity. The main concern of scientific theories is only the relationships indicated by the measurements rather than the distance between the reality and measures.
According to the different properties and relationships of the numbers, there are four different levels of measurement: nominal, ordinal, interval, and ratio. In nominal (also called categorical) measurement, names or symbols are assigned to objects, and this assignment is determined by the similarity of the to-be-measured values or attributes. The categories of assignment in most situations are defined arbitrarily, for instance, numbers assigned to individual marital status: single = 1, married = 2, separated = 3, divorced = 4, and so on, or to religious preference: Christian = 1, Jewish = 2, Muslim = 3, Buddhist = 4, and so on.
In ordinal measurement, the number assigned to the objects based on their attributes reflects an order relation among them. Examples include grades for academic performance (A, B, C …), the results of sporting events and the awarding of gold, silver, and bronze medals, and many measurements in psychology and other social science disciplines.
Interval measurements have all the features of ordinal measurements, but in addition the difference between the numbers reflects the equivalent interval of the attributes being measured. This property makes comparison among different measures of an attribute or characteristic meaningful and operations such as addition and subtraction possible. Temperature in Fahrenheit or Celsius degrees, calendar dates, and standardized intelligence tests (IQ) are a few examples of interval measurements.
In ratio measurement, objects are assigned numbers that have all the features of interval measurements, but in addition there are meaningful ratios between the numbers. In other words, the zero value is a meaningful point on the measurement scale, and operations of multiplication and division are therefore also meaningful. Examples include income in dollars, length or distance in meters or feet, age, and duration in seconds or hours.
Because measurement can be arbitrarily defined by the government, researchers, or cultural norms, it is socially constructed. The social construction of measurement is frequently encountered in social science disciplines. For instance, the U.S. Census Bureau has redefined the measure of race several times. Before the 1980 census, census forms contained questions about racial categories, but the categories included only white, black, American Indian, and specified Asian categories. The census was based on the Office of Management and Budget’s (OMB) 1977 Statistical Policy Directive Number 15, Race and Ethnic Standards for Federal Statistics and Administrative Reporting, defining four mutually exclusive single-race categories: white, black, American Indian or Alaska Native, and Asian or Pacific Islander. In addition, the standards also provided two ethnicity categories: Hispanic origin and Not of Hispanic origin. The 1980 and 1990 censuses were collected according to these standards.
By 1997, OMB modified the race/ethnicity measurement again by splitting the Asian or Pacific Islander category into two groups, creating five race categories: white, African American, American Indian or Alaska Native, Asian, and Native Hawaiian or Other Pacific Islander. In addition, the 2000 census allowed people to identify themselves as belonging to two or more races. It also created six single races and fifty-seven multiple race categories. The ethnicity measure for Hispanic doubled the total number of the race/ethnicity categories to 126. However, such an extensive number of measures causes even more problems. Many Hispanics consider their ethnic origin as a racial category and therefore choose “some other race” on the census form, leading to over 40 percent of the Texas population reported as “some other race.” The misconstruction of the categories of race and ethnicity in the U.S. census illustrates the fluid and subjective nature of measurement.
SEE ALSO Econometrics; Ethnicity; Gender; Likert Scale; Mathematics in the Social Sciences; Measurement Error; Methods, Quantitative; Racial Classification; Regression Analysis; Sampling; Scales; Survey
Michell, J. 1986. Measurement Scales and Statistics: A Clash of Paradigms. Psychological Bulletin 100: 398–407.
Michell, J. 1990. An Introduction to the Logic of Psychological Measurement. Hillsdale, NJ: Erlbaum.
Nagel, E.  1960. Measurement. In Philosophy of Sciences, ed. A. Danto and S. Morgenbesser, 121–140. New York: Meridian.
See also 226. INSTRUMENTS .
- the measurement of the relative amount of acetic acid in a given subtance. —acetimetrical , adj.
- Chemistry. the determination of the amount of free acid in a liquid. —acidimeter , n. —acidimetrical , adj.
- measurement of pain by means of an algometer.
- the measurement of evaporation in the air. —atmidometer , n.
- 1. the measurement of oneself.
- 2. the measurement of a part of a figure as a fraction of the total figure’s height. —autometric , adj.
- the measurement of distance or lines by means of a stave or staff.
- the science of land surveying.
- accurate measurement of short intervals of time by means of a chronoscope. —chronoscopic , adj.
- the science of measuring the universe.
- the measurement of extremely low temperatures, by means of a cryometer.
- the measurement of circles.
- the measurement by a dosimeter of the dosage of radiation a per-son has received. See also 130. DRUGS . —dosimetrist , n. —dosimetric, dosimetrical , adj.
- measurement of the red blood cells in the blood, by use of an erythrocytometer.
- the science of measuring and analyzing gases by means of a eudiometer.
- the measurement of fluorescence, or visible radiation, by means of a fluorometer. —fluorometric , adj.
- the measurement of the strength of electric currents, by means of a galvanometer. —galvanometric, galvanometrical , adj.
- the measurement of the amounts of the gases in a mixture. —gasometer , n. —gasometric, gasometrical , adj.
- the practice or theory of measuring angles, especially by means of a goniometer.
- the measurement of the dimensions and angles of the planes of salt crystals. —halometer , n.
- the practice of measuring the angular distance between stars by means of a heliometer. —heliometric, heliometrical , adj.
- the art or science of measuring time. —horometrical , adj.
- the measurement of altitude and heights, especially with refer-ence to sea level. —hypsometric, hypsometrical , adj.
- the practice and art of determining the strength and coloring power of an indigo solution.
- equality of measure. —isometric, isometrical , adj.
- the measurement of impurities in the air by means of a konimeter. —konimetric , adj.
- kymography, cymography
- 1. the measuring and recording of variations in fluid pressure, as blood pressure.
- 2. the measuring and recording of the angular oscillations of an aircraft in flight, with respect to an axis or axes flxed in space. —kymograph , n. —kymographic , adj.
- Rare. an instrument for measuring large objects. See also 178. GEOGRAPHY .
- 1. the act, process, or science of measurement.
- 2. the branch of geometry dealing with measurement of length, area, or volume. —mensurate, mensurational , adj.
- the study and science of measures and weights. —metrologist , n. —metrological , adj.
- the measurement of osmotic pressure, or the force a dissolved substance exerts on a semipermeable membrane through which it cannot pass when separated by it from a pure solvent. —osmometric , adj.
- the measurement of bones.
- the determination or estimation of the quantity of oxide formed on a substance. —oxidimetric , adj.
- Obsolete, the realm of geometrical measurements, taken as a whole. —pantometer , n. —pantometric, pantometrical , adj.
- the measurement of pressure or compressibility, as with a piezometer. —piezometric , adj.
- the measurement of the plasticity of materials, as with a plastometer. —plastometric , adj.
- the measurement of the capacity of the lungs. —pulmometer , n.
- the measurement of temperatures greater than 1500 degrees Celsius. —pyrometer , n. —pyrometric, pyrometrical , adj.
- the measurement of radiant energy by means of a radiometer. —radiometric , adj.
- the measurement of electric current, usually with a galvanometer. —rheometric , adj.
- a means of surveying in which distances are measured by reading intervals on a graduated rod intercepted by two parallel cross hairs in the telescope of a surveying instrument. —stadia , adj.
- 1. the process of determining the volume and dimensions of a solid.
- 2. the process of determining the specific gravity of a liquid. —stereometric , adj.
- the measurement of distance, height, elevation, etc., with a tachymeter.
- the science or use of the telemeter; long-distance measurement.
- the measurement of the turbidity of water or other fluids, as with a turbidimeter. —turbidimetric , adj.
- measurement of the specific gravity of urine, by means of an urinometer.
- the measurement of the volume of a solid body by means of a volumenometer.
- the measurement of the volume of solids, gases, or liquids; volumetric analysis. —volumetric, volumetrical , adj.
- the measurement and comparison of the sizes of animals and their parts. —zoometric , adj.
Measurement itself is concerned with the exact relationship between the ‘Empirical Relational System’ and the ‘Formal (or Numerical) Relational System’ chosen to represent it. Thus, a strict status relationship between individuals or positions can be shown to have the same properties as the operators ‘>, <’ (greater than, less than) in the set of numbers, and may be thus represented. Most social and psychological attributes do not strictly have numerical properties, and so are often termed ‘qualitative’ or ‘non-metric’ variables, whereas properties such as wealth or (arguably) measured intelligence or cardinal utility are termed quantitative or metric.
For any given domain of interest, a measurement representation or model states how empirical data are to be interpreted formally; for example, a judgement that x is preferred to y may be interpreted as saying that x is less distant from my ideal point than is y. Once represented numerically, uniqueness issues arise. S. S. Stevens (among others) postulates a hierarchy of levels of measurement of increasing complexity, defined in terms of what transformations can be made to the original measurement numbers, whilst keeping the properties they represent (see, for example, his essay ‘On the Theory of Scales of Measurement’, Science, 1946
). The simplest version distinguishes four such levels. At the nominal level, things are categorized and labelled (or numbered), so that each belongs to one and only one category (for example, male = 0, female = 1). Any one-to-one re-assignment of numbers preserves information about a categorization. At the level of ordinal measurement, the categories also have a (strict) order (such as a perfect Guttman scale), and any order-preserving transformation is legitimate. Interval-level measurement requires that equal differences between the objects correspond to equal intervals on the scale (as in temperature) and that any linear transformation preserves the differences. At the ratio level, the ratio of one distance to another is preserved, as in moving from (say) miles to kilometres.
Clyde H. Coombs (‘A Theory of Psychological Scaling’, Engineering Research Institute Bulletin, 1946)
has shown that there are many other such scales (such as partial orderings) which are useful in the social sciences, and urges keeping to lower levels, rather than quantifying by fiat. Procedures for transforming data into higher levels of measurement are known as ‘scaling’ or ‘quantification’. If the representation can be made on a straight line it is unidimensional scaling (as in Guttman and Likert scales), but if it needs two or more dimensions it is multidimensional scaling.
Most textbooks on survey research explain the different levels of measurement, with examples, and describe the statistics and techniques that are appropriate to the different levels (see, for example, D. A. de Vaus , Surveys in Social Research, 1985, 1991
The assessment of a trait or feature against a standard scale.
Psychologists rely heavily on measurements for very different purposes, ranging from clinical diagnoses based on test scores to the effects of an independent variable on a dependent variable in an experiment. Several different issues arise when considering measurement. One consideration is whether the measurement shows reliability and validity. Reliability refers to consistency: if the results of a test or measurement are reliable, a person should receive a similar score if tested on different occasions. Validity refers to whether the measurement will be useful for the purposes for which it is intended.
The Scholastic Assessment Test (SAT) is reasonably reliable, for example, because many students obtain nearly the same score if they take the test more than once. If the test score is valid, it should be useful for predicting how well a student will perform in college. Research suggests that the SAT is a sufficient but not perfect predictor of how well students will perform in their first year in college; thus, it shows some validity. However, a test can be reliable without being valid. If a person wanted to make a prediction about an individual's personality based on an SAT score, they would not succeed very well because the SAT is not a valid test for that purpose, even though it would still be reliable.
Another dimension of measurement involves what is called the scale of measurement. There are four different scales of measurement: nominal, ordinal, interval, and ratio. Nominal scales involve simple categorization but does not make use of the notion of comparisons like larger, bigger, and better. Ordinal scales involve ranking different elements in some dimension. Interval scales are used to assess by how much two measurements differ, and ratio scales can determine the difference between measurements and by how much. One advantage of more complex scales of measurement is that they can be applied to more sophisticated research. More complex scales also lend themselves to more useful statistical tests that give researchers more confidence in the results of their work.
meas·ure·ment / ˈmezhərmənt/ • n. the action of measuring something: accurate measurement is essential | a telescope with which precise measurements can be made. ∎ the size, length, or amount of something, as established by measuring: his inseam measurement. ∎ a unit or system of measuring: a hand is a measurement used for measuring horses.