## Cluster analysis

**-**

## Cluster Analysis

# Cluster Analysis

Quantitative social science often involves measurements of several variables for a number of cases (individuals or subjects). Searching for groupings, or *clusters*, is an important exploratory technique. Grouping can provide a means for summarizing data, identifying outliers, or suggesting questions to study.

A well-known clustering is that of stars into a main sequence, white giants, and red dwarfs, according to temperature and luminosity. The military has used cluster analysis of anthropometric data to reduce the number of different uniform sizes kept in inventory. Cluster analysis in marketing is called *market segmentation;* consumers are clustered according to psychographic, demographic, and purchasing behavior variables. The United States has been divided into a number of clusters according to lifestyle and buying habits.

Establishing the *profile* of a case, an observational unit, is the first step in cluster analysis. The profile of a case is its pattern of scores across a set of correlated variables. Cases with similar profiles should be in the same cluster; cases with disparate profiles, in different clusters. The mean profile of a cluster is the *centroid*, the set of means of the variables, for the individuals in that cluster. Cluster profiles provide a good summary of the data. Examining them provides insight as to what the clusters mean. A cluster’s profile can suggest an interpretation and a name for it.

There are two broad types of clustering algorithms: hierarchical clustering and nonhierarchical clustering (partitioning). Hierarchical clustering follows one of two approaches. *Agglomerative clustering* starts with each case as a unique cluster, and with each step combines cases to form larger clusters until there is only one or a few larger clusters. *Divisive clustering* begins with one large cluster and splits it into smaller clusters.

There are several ways to define intercluster distance. This can be done by forming all pairs of objects, with one object in one cluster and one in the other, and computing the distances between the members of these pairs. *Single linkage* is based on the shortest of these; *complete linkage* on the longest; and *average linkage* on their mean. Joe Ward’s method (1963) is based on the sum of squares between the two clusters, summed over all variables. The centroid method is based on the distance between cluster centroids.

Nonhierarchical clustering is partitioning of the sample. The *K* -means algorithm assigns each case to the cluster having the nearest centroid. The process begins by partitioning the cases into *K* initial clusters and assigning each case to the cluster whose centroid is nearest. The centroids of the cluster receiving the new case and the cluster losing the case are updated. This is repeated until no more reassignments take place. The ISODATA algorithm is similar to *K* -means, except one loops through all cases before the centroids are updated. An alternative to starting with an initial clustering is to start with an initialization of the centroids—for example, as the first *K* cases in the dataset or as *K* cases randomly chosen from it.

The notion of *nearest* requires a notion of *distance*. Often, rightly or wrongly, researchers use *Euclidean distance*, which is the length of the hypotenuse of a right triangle formed between the points. Euclidean distance is appropriate for variables that are uncorrelated and have equal variances. Standardization of the data is needed if the range or scale of one variable is much larger than that of others. *Mahalanobis distance* (statistical distance), which adjusts for different variances and for the correlations among the variables, is preferred.

It is sometimes suggested that researchers start with hierarchical clustering to generate initial centroids, and then use nonhierarchical clustering. A conceptual model for clustering is that the sample comes from a mixture of several populations. This leads to a mathematical probability model called the *finite mixture model*. If the within-cluster type of distribution is specified (such as multivariate normal), then the *method of maximum likelihood* can be used to estimate the parameters. This is done with an iterative algorithm.

There are several procedures for determining the number of clusters. This task should be guided by substantive theory and the practicality of the results. A criterion such as between-groups sum of squares or likelihood can be plotted against the number of clusters in a *scree plot*. When a normal mixture model is used, model selection criteria such as Akaike information criterion (AIC) and Bayesian information criterion (BIC) can be used.

Once the clusters are formed, researchers can use *discriminant analysis* to determine which variables account for the clustering and to classify new cases into the clusters. Some cluster techniques operate on distances or similarities rather than raw data. Variables can be clustered using their correlations as similarities. Simultaneous clustering of cases and variables is called *block clustering*. If a subset of the cases has similar values on a subset of the variables, these cases and variables form a block.

James MacQueen’s development of his *K* -means algorithm (1967) was a milestone in the development of cluster analysis. John Wolfe (1970) was the first to program maximum likelihood clustering for the finite normal mixture model. John Hartigan’s *Clustering Algorithms* (1975) did much to stimulate interest in cluster analysis. Geoff McLachlan and David Peel’s *Finite Mixture Models* (2000) is a comprehensive presentation of model-based clustering.

## BIBLIOGRAPHY

Hartigan, John A. 1975. *Clustering Algorithms*. New York: Wiley.

MacQueen, James B. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In *Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability*. Vol. 1: *Theory of Statistics*, ed. Lucien M. LeCam and Jerzy Neyman, 281-297. Berkeley: University of California Press.

McLachlan, Geoffrey, and David Peel. 2000. *Finite Mixture Models*. New York: Wiley.

Ward, Joe H., Jr. 1963. Hierarchical Grouping to Optimize an Objective Function. *Journal of the American Statistical Association* 58: 236-244.

Wolfe, John H. 1970. Pattern Clustering by Multivariate Mixture Analysis. *Multivariate Behavioral Research* 5: 329-350.

*Stanley L. Sclove*

## cluster analysis

**cluster analysis** A form of multivariate analysis, of which the purpose is to divide a set of objects (such as variables or individuals), characterized by a number of attributes, into a set of clusters or classes, in such a way that the objects in a class are maximally similar to each other and maximally different to the other objects, with reference to a selected list of descriptive indicators and characteristics which form the basis of the analysis. In biology the technique is known as numerical taxonomy.

Cluster analysis was among the multivariate statistical techniques developed by Eshref Shevky and and Wendell Bell (Social Area Analysis, 1955)

for analysing census data. It is applied to census small-area statistics and social indicators in social area analysis to create area typologies, either focusing on particular urban or metropolitan areas, or covering the country as a whole. Cluster analysis found a wide range of applications in other areas, including developmental work with opinion statements or questions from which an attitude scale will be formed; exploratory work to identify underlying patterns in large data-sets; analytical work to measure significant similarities and differences between individuals, social groups, companies, or other types of organization, nation-states, types of event, and so forth; and the development of classifications and typologies.

Different ways of defining similarity and difference give rise to distinct methods of clustering. Alternative ways of determining how well the solution fits the data will generally give rise to somewhat disparate results. Most classification procedures begin with a table of association of dis/similarity coefficients between each pair of objects and then proceed in one of two ways—bottom up (where the objects are successively merged into larger clusters) or top down (where the entire set of objects is divided into increasingly small clusters). These yield as a solution a hierarchical clustering scheme (HCS), which is represented by a dendogram, or tree. An HCS is also often represented as a set of contours within a multi-dimensional scaling solution of the same data. The most common clustering method is stepwise hierarchical clustering with output displayed in a dendogram figure, which clearly identifies any outlier cases that remain separate from other cases until the final stage of the clustering process when all cases are combined in a single group, with three or more intermediate levels of aggregation.

Recent developments in this field include additive overlapping clustering (where each cluster has a measure of its importance), additive trees (where the length of the path between points represents the data dissimilarity), and rectangular clustering (where both the individuals and the variables of the data are clustered jointly).

## cluster analysis

**cluster analysis** Any statistical technique for grouping a set of units into clusters of similar units on the basis of observed qualitative and/or quantitative measurements, usually on several variables. Cluster analysis aims to fulfill simultaneously the conditions that units in the same cluster should be similar, and that units in different clusters should be dissimilar. It is not usually possible to satisfy both conditions fully, and no single method can be recommended as best for all sets of data. Among other desirable properties of clusters are that some variables should be constant for all units within a cluster, which makes it possible to provide a simple scheme for identification of units in terms of clusters.

Most cluster analysis methods require a *similarity* or *distance* measure to be defined between each pair of units, so that the units similar to a given unit may be identified. Similarity measures have been proposed for both quantitative (continuous) variables and qualitative (discrete) variables, using a weighted mean of similarity scores over all variables considered. The term distance comes from a geometric representation of data as points in multidimensional space: small distances correspond to large similarities.*Hierarchical cluster analysis* methods form clusters in sequence, either by amalgamation of units into clusters and clusters into larger clusters, or by subdivision of clusters into smaller clusters and single units. Whichever direction is chosen, the results can be represented by a *dendrogram* or family tree in which the units at one level are nested within units at all higher levels.*Nonhierarchical cluster analysis* methods allocate units to a fixed number of clusters so as to optimize some criterion representing a desired property of clusters. Such methods may be iterative, involving transfer of units between clusters until no further improvement can be achieved. The solution for a given number of clusters need bear little relation to the solution for a larger or smaller number.

Cluster analysis is often used in conjunction with other methods of multivariate analysis to describe the structure of a complex set of data.