A battery of psychological tests provides a profile of test scores for each subject. Can the subjects be grouped sensibly into a moderate number of classes or clusters so that the subjects in each class have similar profiles? Can the clustering process be made automatic and feasible while producing subjectively meaningful results?
These types of data and questions arise throughout the social and natural sciences. The “subjects” or units may be such things as census tracts, nations, tribes, legislators, plants, fossils, micro-organisms, documents, languages, or corporations. The data consist of a vector or profile of measurements or observations for each unit, one measurement on each of a number of variables or characters. Both quantitative variables and binary characters, scored only “present” or “absent,” are frequent and important.
Often, the profile of measurements on a unit may consist of some measure of interaction with each other unit. For example, the units might be importing nations and the variables the relative value of imports from each other (exporting) nation. Special attention must be given to any use of the interaction of the unit with itself. Social choices between individuals give other examples, and special techniques for analyzing such data have been developed [seeSociometry].
The role of units and variables may be interchanged. In the preceding example, if clustering on the basis of export pattern were desired, the old “variables” would become new “units.” (The measurements would change, since relative value would be computed with respect to a different base.) Or, again, from the voting record of legislators on each of many issues, one might want to cluster legislators or to cluster issues.
Clustering methods form a loosely organized body of techniques for the analysis of such data. As with most methods of data analysis, the aim is to find, to describe, and, hopefully, to lead to an explanation of some simple structure in a complex mass of data. Clustering methods are distinguished by the type of structure that is sought.
Examples in two dimensions. Geometric notions motivate the terminology and methods of clustering, although only for one or two variables can the geometric representation be used directly.
For example, suppose the units are essays, putatively written by a single author, and that the variables are two quantitative indicators of style, say mean sentence length and the rate of occurrence of the word of in the essay.
In Figure 1, each essay is represented by a point whose coordinates are the two measurements on the essay. Units are similar if their corresponding points are close together; clusters of units correspond to geometric clusters of points. In Figure 1, three clusters are visually evident and so clear-cut that a search for a theoretical explanation is natural. Might the essays have been written by three different authors?
Figure 2 illustrates data in which the existence of two overlapping clusters is suggested, but the identification is incomplete. Further variables might permit better definition of clusters, but even the imperfect clustering may aid the description and understanding of the data.
Extensions to more than two dimensions. The aim of most clustering methods is to imitate and automate the process that can be done well visually in two dimensions and to extend it to any number of dimensions. The gain is the extension to many dimensions; that the process is in some ways inferior to visual inspection in two dimensions should be neither surprising nor disturbing.
If it is possible to group units into a moderate number of classes, within which the units have similar profiles, then a reduction of the data has been achieved that is easily described, that facilitates the analysis of further data, and that may be suggestive of theoretical structures underlying the data.
Each class can be described by a typical or average profile and, secondarily, by some measure of variability (shape and magnitude) within the class. Once the classes are well determined, future units can be placed in one of the classes through techniques of discriminant analysis [seeMultivariate analysis, article onclassification and discrimination].
In some instances the clusters may correspond to underlying types explained by theory. In others they may represent only a convenient empirical reduction of the data. Interest may lie exclusively with the units included in the analysis or may extend to additional units.
The clustering process may be carried out on a finer scale, dividing a class into subclasses. More generally, an entire hierarchical classification or taxonomy can be obtained. It is in this latter form that clustering methods have been developed and applied in biology.
The technology of clustering is in an early and rapidly changing state. With a few exceptions, the development of clustering methods began only in the late 1950s, as large, high-speed computers became widely available. Many methods have been proposed and some have been tried and proved useful, but none is thoroughly understood or firmly established. The next few years should bring dramatic advances in the formulation of new methods, in techniques of computer execution, and in theoretical foundations. Most important, understanding of how and why procedures work or fail will be gained through experience and comparative experimentation.
Methods of clustering can usually be broken down into these steps: selection of units and of variables for inclusion in the data, determination of measures of similarity between pairs of units, grouping of units into clusters, interpretation and description of the results of grouping, possible iterations, and further analysis. Not all methods fit this decomposition, but each step has some counterpart in most methods.
Selection of units and variables
Specification of the data requires selection both of units and of variables and a choice of scale for each variable. The importance of selection of units is not yet fully understood. One effect is clear: including many units of one type will ensure that that type shows up as a cluster. Overrepresentation of a type has no strong effect on the clustering of the rest. Selection may be easy. In a study of voting behavior of a legislature, all legislators would be used. In a socioeconomic study of counties in the United States, a population of units is evident and a random or stratified sample might be included. Mostly, though, there is no clear universe of units, and those easily available are not a probability sample from any population. Understanding must await more experience with the results of clustering, especially in applications that go beyond a description of the original data.
Selection of variables and characters is more critical than selection of units and determines what aspects of behavior of units will be represented or emphasized in the clustering. Overrepresentation of variables in some area can strongly affect the entire outcome. Not only is there no natural universe of variables, but the notion of what constitutes an individual variable is not well defined. Variables can be refined, subdivided, and combined. Two variables that are functionally dependent or that are highly associated statistically can usually be recognized and avoided, but variables will inevitably have some statistical dependence over various groups of units. The choice of variables is a highly subjective, crucial element of the clustering process.
Scaling of variables. The data in any clustering process may be arranged in a rectangular array whose rows correspond to the units being clustered and whose columns correspond to the observed variables or characters. The entries in a row form the vector or profile of measurements of a unit on the respective variables. Most work has used variables that are measured on a numerical scale or else binary characters that can be so represented by denoting presence by a 1 and absence by a 0. Although the latter may be treated as a quantitative variable, there are substantial gains, primarily in computer technique, to be achieved by taking advantage of the 0-1 nature of binary characters. Much development has been restricted to binary variables. While any variable may be represented, at least approximately, by several binary variables, the representation introduces new arbitrariness and other difficulties.
With general quantitative variables, the problem of scaling or weighting is critical. If, for example, the first variable in Figure 2 were the median years of schooling (in a county, say) and the second were the median family income measured in thousands of dollars, to change to income measured in hundreds of dollars would expand the vertical scale by a factor of 10 and greatly distort the picture. While the eye may adjust for a bad choice, the numerical measures used in two and more dimensions are severely affected by a bad choice. Imposing an arbitrary, standard scaling by requiring each variable to have a standard deviation of 1 over the units included is a frequent choice. It is less subjective but no less arbitrary and often poorer than imposing a choice based on an external standard or on subjective judgment.
Superficially, there is no scaling problem for binary variables that all take the same values, 0 and 1; but the difference between two variables, one about equally often 0 or 1 and another that takes the value 1 for about 2 per cent of the units, may indicate a need for some scaling of binary variables also.
Measures of similarity
The geometric representation of figures 1 and 2 can be extended conceptually to more than two variables. Generally, if the measurements of the first unit on k variables are denoted by (x1l, …, x1k), then the unit can be represented as a point in k-dimensional space with the measurements as coordinates. If the second unit has measurements (x2l, …, x2k), similarity of the two units corresponds to “closeness of the points” representing them. One natural measure of dissimilarity between points 1 and 2 is Euclidean squared distance,
Unless the variables have been carefully scaled, a weighted distance,
w1(x11—x21)2 + (x12—x22)2+…+wk(x1k—x2k)2,
is needed to make sense of the analysis. In order to allow for patterns of statistical dependence among the variables, a more complex weighting is required, such as
There is no uniquely correct choice of weights, but a careful subjective choice based on external knowledge of the variables, observed pattern of variability, and computational feasibility should be workable and will be preferable to an arbitrary but objective choice such as using the equally weighted Euclidean distance. New theory and methods— likely requiring lengthy and iterative computation —to make more effective use of internal patterns of variability to guide the choice and adjustment of weights will gradually be developed (see Ihm 1965).
For quantitative variables, the measures of similarity commonly used are equivalent to one of the weighted or unweighted squared distances. For binary variables, a greater variety of measures are in use. The equally weighted squared distance, when applied to variables taking only the values 0 and 1, yields the number of variables for which the items fail to match, a measure equivalent to the simple matching coefficient—the proportion of characters for which the two units have the same state. If a 1 represents possession of an attribute, a positive match, 1-1, may be more indicative of similarity than a negative match, 0-0, and numerous ways of taking account of the difference have been proposed (see Sokal & Sneath 1963, section 6.2).
Computation in grouping into clusters
Clustering methods require vast amounts of computation. Two examples illustrate some magnitudes and their relevance to methodology.
Suppose 100 units are to be clustered, so that there are 4,950 pairs of units. To compute un-weighted distances for all pairs of units on the basis of, say, 50 variables requires sums of about 250,000 products. This is easily handled by currently available computers, although attention to computational technique, especially in the storage of data, will be critical as the number of units increases.
Consider next an operation that in principle is natural and attractive: Divide the 100 units into two groups in all possible ways, evaluate a criterion measuring the homogeneity of the groups for each partition, and choose the best one. With 100 units, there are (2100 — 2) @ 6.3 1029 groupings. Even if successive groupings could be generated and the criterion evaluated in a spectacularly efficient ten machine cycles, and using the fastest currently available machine cycle of 10-6 second, the process would require about 1017 years. Division into three and more groups is worse. Hence, many conceptually useful processes can never be realized in practice. Some modification or restriction is required to reduce drastically the possibilities to be considered, and even then results that only approximate conceptually optimum procedures must be accepted.
As an example of a possible extreme simplification, suppose that the units were dated and that time was an important variable. If the units in any cluster were required to be adjacent in the time sequence, then the number of partitions of 100 units into two groups is reduced to 99, into three groups to 4,851 (Fisher 1958).
A second possibility, based on the preceding one, is to require the grouping to be ordered on some variable, not prespecified but chosen to make the grouping best. When all variables are dichotomous, this procedure becomes remarkably simple and has been highly developed and used in plant classification with large numbers of units and variables. Lance and Williams (1965) survey these methods.
Hierarchical methods. A third possibility is to reverse directions, building up clusters by combining one or more units at a time. Most of the procedures now used fall in this category. One commonly used version is built on repeated application of the following rule: Find the closest pair of clusters and combine them into a new, larger cluster and compute the distance between the new cluster and each of the remaining clusters. Initially, the clusters are the N units, treated as single-unit clusters. The first new cluster will be a two-unit cluster; the next may be a second two-unit cluster or it may be a three-unit cluster.
Since the process starts with N clusters and each stage reduces the number by one, the process terminates after exactly N — 1 steps, with all N items combined in one huge cluster. Of chief interest are the intermediate results, especially the last few, where the units are sorted into a moderate number of clusters. An interpretative phase is needed.
The output can be well presented in a linkage tree or dendrogram, as illustrated in Figure 3. The initial units are placed on a line in an order that is easily determined from the process. Vertical lines are dropped from each unit until it joins with another. If the distance measure is used as the vertical scale, then the vertical location of each joint shows how dissimilar the two groups joined were. Thus clusters formed near the top represent homogeneous groups, clusters formed farther down represent groups in which some units differ in larger ways, and clusters formed far down represent clusters formed only because the process goes to its logical end of one big group. The linkage tree generated by the process is a salient feature and potential advantage. A hierarchical nested structure is obtained upon choice of several horizontal slices of the tree. Slices may be made at standard vertical positions, or to give desired numbers of groups, or at natural breaks in the particular tree— where there are long vertical stretches without joints.
Another salient feature of this version is how little must be specified as part of the process. Once a distance measure is specified— between two clusters of any size, not just between two units— the process is completely determined. All judgment of how many clusters or how many levels of clustering is postponed to the next phase.
A major variation that gives up both salient features begins in the same way but proceeds by building up the first cluster until adding more units would make it too inhomogeneous. Then another cluster is constructed, etc. The process requires a rule for judging when to stop adding to a cluster (as well as the distance measure to determine which unit to add). As described, the process gives no hierarchy of clusters, but it could be reapplied to the clusters of the first round.
How should a measure of similarity between units be extended to a measure of similarity between clusters? Should it be based on the most similar units in the clusters? the least similar? an average? Should all units in a cluster be weighted equally, or should two joining stems be weighted equally? Methods based on these and many other possibilities have been tried. The choices matter—some favor large clusters, others smaller, equalsized clusters—but no adequate basis for choice yet exists.
Where to stop clustering or how to cut the tree are open questions. The objectives of the analysis are essential to the decision, but perhaps some statistical models may be developed that would aid the choice.
The possibility of treating clustering as an iterative process with the results of one clustering used to improve the next has just begun to be explored and will surely play an essential part in any thorough clustering procedure. Ihm (1965) illustrates one use of this idea to deal with the choice of weights. Ball (1965) describes a composite procedure that is highly iterative.
Relation to factor analysis
Factor analysis has frequently been used in lieu of direct clustering procedures. Both methods of analysis are attempts to discover and describe structure in an unstructured rectangular array of data. The methods seek different types of structure, although there is much overlap. Described in conventional statistical terms, factor analysis is an attempt to find a few “independent” variables such that regression on those variables fits the observed data. Geometrically, it is an attempt to locate a linear subspace that nearly contains the data [seeFactor analysis].
In contrast, cluster analysis is an attempt to find an analysis of variance classification of the units (a one-way or nested design) that fits the observed data, that is, that reduces unexplained variation [seeLinear hypotheses, article onanalysis of variance].
Any cluster structure can be explained by a factor structure, although generally as many factors are needed as there are clusters. However, if two or three factors explain most of the variation, then by estimating factor scores for each unit, the units can be represented as points in two or three dimensions and clusters determined visually. This is one of the oldest routes for finding clusters.
Interchanging the role of units and variables in a factor analysis is common when clusters are sought. Then the process of rotation to “simple structure” may show directly the presence of clusters of units. These approaches permit the use of the older, better developed techniques of factor analysis.
Role of statistics
Clustering methodology has not yet advanced to the stage where sources of error and variation are considered formally; consequently, formal statistical methods play almost no role at present.
Unfortunately, even informal consideration of statistical variation has been neglected, in part because, typically, neither units nor variables are simple random samples, so conventional statistical theory is not directly applicable. Thus, although measures of similarity may be algebraically equivalent to correlation coefficients, conventional sampling theory and tests of hypotheses are not applicable or relevant.
Statistical variation. Several levels of statistical variation require consideration. At the lowest level are variation and error in the measurements. At this level there may be measurement error, as in determining the median income in a census tract, or there may be intraunit variability, as when units are Indian castes and tribes being clustered on the basis of physical measurements on individual members. Measurement error and intraunit variability, even on binary attributes, are present more often than is hoped or admitted, and they degrade the quality of clustering. Sometimes the variation must be accepted, but often it can be reduced by improved measurement technique or by inclusion of more than one instance of each unit.
A second level is variation of units within a cluster. This variation cannot be estimated until clusters are at least tentatively determined, but this is the important variability for determining statistical scales and weights, as would be used, for example, in classifying new units into established clusters.
Variation between clusters is the largest source and is the variation explainable by the clustering. Its magnitude depends heavily on the selection of units. Unfortunately, scaling variables according to over-all variability reflects largely this cluster-to-cluster variability.
Statistical models for variability in clustering are not yet highly developed. The fundamental paper by Rao (1948) and current work by Ihm (1965) are of note. One statistical model for clustering is very old. The distributions of observed data like those illustrated in figures 1 and 2 are naturally rep-resented as mixtures whose component distributions correspond to the clusters. This formulation has greater conceptual than practical value; direct numerical use is troublesome even for a single variable [seeDistributions, statistical, article onmixtures of distributions].
Development and applications
After a long, slow history in social science, notably in psychology and anthropology, numerical clustering methods underwent a rebirth of rapid, serious development —first through work in numerical taxonomy, both in conventional biology and in microbiology, and later in automatic classification and pattern recognition. These developments have spread throughout the social and natural sciences.
The book Principles of Numerical Taxonomy by Sokal and Sneath (1963) provides the most comprehensive presentation of methods and principles and a thorough bibliography. Later developments in the biological area are represented in such periodicals as the Journal of Ecology, Journal of Bacteriology, Journal of General Microbiology, and the newsletter Taxometrics. The symposium of the Systematics Association (Heywood & McNeill 1964) is valuable.
The history of clustering in anthropology and linguistics has been surveyed by Driver (1965) in a detailed discussion of principles and results. [Applications in geography are described in Geography, article onstatistical geography.]
Clustering methods in psychology, often as an adjunct of factor analysis and often under the name “pattern analysis,” have developed from Zubin (1938) and Tryon (1939). Many grouping ideas were introduced by McQuitty (1954; 1964) in a long series of papers.
Work on political districting is exemplified by Weaver and Hess (1963) and Kaiser (1966). Developments and applications in pattern recognition, information retrieval, and automatic classification are in large part not yet in the public literature (Needham 1965). Ball (1965) surveys approaches taken in many areas.
David L. Wallace
Ball, Geoffrey H. 1965 Data Analysis in the Social Sciences. Volume 27, part 1, pages 533–560 in American Federation of Information Processing Societies Conference, Proceedings. Fall Joint Computer Conference. Washington: Spartan Books; London: Macmillan.
Driver, Harold E. 1965 Survey of Numerical Classification in Anthropology. Pages 301–344 in Dell Hymes (editor), The Use of Computers in Anthropology. The Hague: Mouton.
Fisher, Walter D. 1958 On Grouping for Maximum Homogeneity. Journal of the American Statistical Association 53:789–798.
Heywood, Vernon H.; and Mcneill, J. (editors) 1964 Phenetic and Phylogenetic Classification. London: Systematics Association.
Ihm, Peter 1965 Automatic Classification in Anthropology. Pages 357–376 in Dell Hymes (editor), The Use of Computers in Anthropology. The Hague: Mouton.
Kaiser, Henry F. 1966 An Objective Method for Establishing Legislative Districts. Midwest Journal of Political Science 10, no. 2:200–213.
Lance, G. N.; and Williams, W. T. 1965 Computer Programs for Monothetic Classification (“Association Analysis”). Computer Journal 8:246–249.
McQuitty, Louis L. 1954 Pattern Analysis Illustrated in Classifying Patients and Normals. Educational and Psychological Measurement 14:598–604.
Mcquitty, Louis L. 1964 Capabilities and Improvements of Linkage Analysis as a Clustering Method. Educational and Psychological Measurement 24:441–456.
Needham, R. M. 1965 Computer Methods for Classification and Grouping. Pages 345–356 in Dell Hymes (editor), The Use of Computers in Anthropology. The Hague: Mouton.
Rao, C. Radhakrishna 1948 The Utilization of MultipleMeasurements in Problems of Biological Classification. Journal of the Royal Statistical Society Series B 10: 159–193. → Pages 194–203 contain an especially interesting discussion of this paper.
Sokal, Robert R.; and Sneath, Peter H. A. 1963 Principles of Numerical Taxonomy. San Francisco: Freeman.
Tryon, Robert C. 1939 Cluster Analysis: Correlation Profile and Orthometric (Factor) Analysis for the Isolation of Unities in Mind and Personality. Ann Arbor, Mich.: Edwards.
Ward, Joe H. JR. 1963 Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 58:236–244.
Weaver, James B.; and Hess, Sidney W. 1963 A Procedure for Nonpartisan Districting: Development of Computer Techniques. Yale Law Journal 73:288–308.
Zubin, Joseph 1938 A Technique for Measuring Likemindedness. Journal of Abnormal and Social Psychology 33:508–516