Cluster Analysis Tool

Created:6/24/2002
Description:

This sample contains a command that uses a cluster analysis engine (available as a separate sample) to classify data into a predetermined number of classes using multiple numeric attributes. The command adds a field containing the classification to the source data table and generates a dendrogram to help interpret the classification. A detailed discussion of cluster analysis and the use of this sample follows.

Cluster Analysis

Cluster analysis is a technique for classifying numerical data using multiple attributes. It is an exploratory technique, whose goal is to help you to better understand what patterns exist in a given data set, and to propose explanations for those patterns. Cluster analysis can be particularly useful when combined with mapping, because the clusters that emerge may form geographic patterns that lead to insights about connections between patterns in attribute data and the spatial context within which those patterns formed.

To understand how cluster analysis works, it's useful to think about the technique spatially. If cluster analysis is used on two variables, it can be thought of as finding clusters in two-dimensional space. For example, in the small map below, cluster analysis is performed on the 50 states and the District of Columbia using the two variables of median rent and median home value.

The Scatterplot

The graph below is called a scatterplot. A scatterplot contains a point for each record in the dataset (in this case a point for each state). The X-axis of the scatterplot represents housing value, and the Y-axis represents rent, thus creating an attribute space within which we can identify clusters. Points that are in the upper right-hand corner of the scatterplot are states that have high mean rent and high mean value, while points in the lower left are states that have low mean rent and low mean value. (Note: The scatterplot pictured here was generated using the Scatterplot Tool, which can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Scatterplot Tool).

Cluster analysis looks at the patterns that these points form in data space. The type of cluster analysis that we used here starts out by placing each point in the scatterplot in its own cluster. It then looks to see which two points are closest to one another (in data space). Those two points are added together to create a new, larger cluster. The process then repeats itself, finding the two clusters that are closest to one another, and "lumping" them together into a larger cluster. The process is complete when all individuals in the data set have been lumped together into one big cluster.

The Dendrogram

One product of cluster analysis is a tree diagram representing the entire process of going from individual points to one big cluster. This diagram is called a dendrogram, and is illustrated below. Once the cluster analysis algorithm has been run, the user must decide how many clusters he or she wants to explore (this is sometimes referred to as "pruning" the dendrogram). In this example we have chosen to look at four clusters (symbolized in red, yellow, green, and blue).

Deciding the number of clusters to map can be aided by looking at the dendrogram. There are three key pieces of information that you can get from the dendrogram. In the dendrogram above, the yellow cluster is labeled so that you can see the parts of it that represent these pieces of information. They are: The weight of each cluster is represented by the number of leaves that that branch of the dendrogram leads to. Because each leaf is equally spaced along the Y-axis of the dendrogram, the weight of a cluster is its percentage of the total height of the dendrogram.

The compactness of a cluster represents the minimum distance at which the cluster comes into existence. The horizontal axis of the dendrogram measures the distance between clusters. If a cluster contains only one observation, its compactness is 0. This is why all the leaves line up on the left-hand side of the dendrogram. The relative compactness of the yellow cluster can be estimated by looking at the point at which all of its branches merge together, and the relative distance of that point from the left-hand side of the dendrogram.

The distinctness of a cluster is the distance along the X-axis from the point at which it comes into existence to the point at which it is aggregated into a larger cluster. Distinctness can be seen on the dendrogram as the length of a branch along the horizontal axis.

When choosing a classification, you will want to choose clusters that are as compact and distinct as possible. In the example above, there are four very distinct clusters. However, we could have chosen to break the green cluster into its two components, which are both fairly distinct clusters as well, and are more compact. You may wish to run the cluster analysis a number of times, choosing different numbers of clusters, and exploring how the mapped patterns of those clusters are distributed.

Interpreting the Clusters

Once you have chosen the clusters that you will use to symbolize your map, go back to the scatterplot to gain a better understanding of what those clusters mean. In this case, the red cluster represents states with high median rent and high median home value. The blue cluster represents states with low median rent and low median home value. The yellow and green clusters fall somewhere in-between.

While it is fairly easy to interpret cluster analysis when it is performed on just two variables, it becomes much more difficult as more variables are added. One tool that you can use for interpreting a more complex run of cluster analysis is a scatterplot matrix (pictured below). The Scatterplot Tool can also be used to generate a scatterplot matrix (can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Scatterplot Tool). The example below shows a scatterplot matrix of cities in the United States with population greater than fifteen thousand classified by the ethnicity variables from the 2000 census.

The User Interface

The user interface for the cluster analysis sample consists of a context command that must be placed in the Feature Layer context menu. When executed, that command opens a form that allows you to specify parameters and then to run the cluster analysis algorithm. Running the algorithm produces a dendrogram and a new field in the source table storing the new classification. The dendrogram consists of a dataframe with three layers: two point layers (the first contains the nodes and the second the leaves) and a line layer (contains the branches). Each dendrogram has one leaf corresponding to each feature in the map (in the example above, each leaf represents a state). The leaf feature layer contains all of the data associated with the source feature layer. The nodes and branches both have the following fields containing results from the cluster analysis:

The parameters in the form include the following:

Additional Information

This sample consists of a form and two classes (listed below). It also requires that you install the Clustering Engine sample, which is used to perform the cluster analysis. The Clustering Engine sample can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Clustering Engine.



How to use:
  1. Install the Clustering Engine sample, which contains functionality necessary for this sample to run. The Clustering Engine sample can be found in ArcObjects Developer Help under Samples > Analysis and Visualization > Clustering Engine.
  2. Register this sample's dll by compiling the sample or by using Regsvr32.exe. Note: If you choose to compile a new DLL on your machine, you will need to re-register the Clustering Engine DLL in the ClusterAnalysis.vbp file)
  3. In ArcMap, open the Customize dialog by selecting Tools-->Customize...
  4. In the Tool Bars tab make sure Context Menus is checked
  5. In the Commands tab select the Add from file... button, and add the ClusterAnalysis.dll file in the subsequent dialog.
  6. You should now be in the Analysis Samples category in the left pane of the Commands tab. Select the Cluster Analysis command from the right pane, and drag and drop it into the Feature Layer context menu. Close the customization dialog.
  7. To apply the command, right-click on a feature layer, and select it from the context menu.

Application:
ArcMap

Requires: An ArcMap session with a feature layer and the Clustering Engine sample.

Difficulty: Advanced


Visual Basic
File Description
DendroGenUI.cls Implements the ICommand interface to add the cluster analysis tool to the UI.
DendroGenerator.cls Uses the cluster analysis engine to run a cluster analysis and build a dendrogram.
ClusterAnalysis.vbp Visual Basic project file.
ClusterAnalysis.dll The compiled component.
frmMakeDendro.frm Dialog box for setting parameters and running cluster analysis to generate a dendrogram.


Key CoClasses: Table, Field
Key Interfaces: ICommand, ICursor, ITable, IField
Key Members: AddField