Chapter 6: Density Methods.

Useful Concepts.

These methods regard clusters as patches of “high-density” in attribute space, surrounded by lower density. To devise methods using this concept, we need to have some means of calculating density. The simplest is to take the number of objects in a unit volume in attribute space and to define a unit volume as that included in a unit step for each attribute. This assumes the attributes are not scaled to lie in the range (0,1), but probably expects they are all scaled to the same range of values, say 0-100.

In discussing these methods, there are several concepts associated with density that we shall find useful:

1. Straggliness.

Straggliness implies a much greater extension of the group in a few dimensions than in all the others. A simple example is a “sausage-shaped” cluster in three-dimensions.

2. Gaps and Moats.

A “moat” is a volume of low-density surrounding a high-density cluster. A gap is a low-density volume, but it need not surround a cluster. The distance from each member within a cluster to the closest object in another cluster may be computed and will give a measure of “separateness” or “isolation” of clusters.

3. Connectivity.

Two objects are said to be “connected” if their distance apart in attribute space is less than a certain specified value. In some presentations of results, such objects are shown joined by a straight line.

A cluster is said to be “fully-connected” if every object in the cluster is directly connected to all other objects in the cluster. This occurs when the dimensions of the cluster are less than the specified value for connectivity.

A cluster is said to “minimally-connected” if each object is connected to all the others, either directly or via a chain, and there are t-1 connections joining the t objects. Thus severing any one connection will result in two separate clusters.

A cluster anywhere between these two extremes is said to be partially-connected and this is the most common state of affairs.

Carmichael's TAXMAP Method.

TAXMAP stands for “Taxonomic Mapping” and is an extension of the nearest-neighbour method. It attempts to reduce the effects of chaining by testing for the presence of a moat before each new object is added. This is done by evaluating the discontinuity at each step, where
“discontinuity” = “new average similarity” – “drop in average similarity”.
This measure has been found to decrease smoothly until there is a discontinuity, unlike the drop in average similarity which may vary widely. A worked example of this method may be found in the text by Everitt.

Cartet-Count Method.

This technique was described by Cattell and Coulter in 1966. The basic idea was to partition the attribute space into “cartets” or hypercubes and then count the number of objects in each cartet. By choosing suitable boundary values, it is possible to identify the clusters in the distribution.

Hypersphere Method.

This method, which I developed to deal with a particular problem, uses a similar concept to the Cartet-count method. However my method was designed to deal with the cases where a distance or similarity matrix is available but the original data is not. This occurs in a number of cases of published archaeological data. In this case, we have values of the distance between objects, but no information on the direction of the line joining them.

Thus if we take one object, somewhere in the data, as the centre, we know that another object at distance 0.1 from it lies somewhere on the hypersphere of radius 0.1. Figure 6.1 shows a set of clusters in two-dimensions and successive circles, centred on one of the points, with increasing radii. Note that for this illustration, we have two-dimensional data and the distribution is known.

In the general case, we would start with the distance (or similarity) matrix and deduce the distribution in n-dimensional attribute space. The number of points lying in each annulus is plotted in the histogram in Figure 6.2. This suggests an approach along the following lines.

1. Find a starting point. Choose some value d for the increment in distance. For example, if the distances are normalised to lie in the range (0,1), then d=0.1 would be a reasonable value for a first attempt.

For each object in the distance matrix, count the number of other objects lying within a distance d of this object. Select the object with the maximum number of other objects close to it. If there are any clusters in the data, then this object must lie near the centre of one such cluster.
2. Calculate a histogram centred on this object. For each annulus of width d, count the number of objects lying in the annulus. Then plot the histogram, giving a result similar to Figure 6.3.

3. Identify the high-density cluster and the low-density moat from this histogram. If the edge of the moat is at distance D from the seed-point, then remove the seed-point and all other objects within distance D from it and take these as one cluster.

The view of other clusters in a histogram centred on the seed-point will probably not identify them very clearly. However the whole method may easily be re-applied to the remainder of the data and generate histograms centred on other points, each of which allows one cluster to be clearly identified. Sometimes different values of d must be tried to give clear resolution of the edge of the moat.

Using the histogram will give a poor value of the density of objects in the outer annuli, because the area over which the points are spread is so much larger. One possible solution would be to divide the number of points by the area of the annulus. In two dimensions, the area has the equation

a(i) =  p(r2{i} - r2{i-1})

and so the density of the annulus from r{i} to r{i-1} is given by
D(i) = n(i)/a(i)

If the whole distribution contains N objects spread over a circle of radius R, then the average density D is given by

D = N / p R2

Thus the relative density of each annulus is given by D(i)/D and if we plot this against the distance from the seed-point, we would expect higher than average values of density within the cluster, lower than average in the surrounding moat and settling down to average values (relative density close to 1.0) as we get further away from the seed-point. The plot in figure 6.5 shows this graph for the data-set used in the earlier examples, and you will note that it has a very large value for the centre of the cluster and much lower values for the other two clusters. With this small set of data, thereis no sign of the data settling down to the average value, but it is not a typical problem since there are only 50 values in all, instead of the several hundred in most real problems, and also they are organised into three clearly defined clusters. Many real problems have a scatter of points which do not belong to any cluster and this will blur both the histogram and the density plot.

Mode Analysis.

This method is supplied in the Clustan package instead of any of the density methods. It is described in terms of distance coefficients and requires a parameter k. Then for each object in the matrix, we calculate the average of the 2k smallest distances, call this a(i).

If a(i) is small, then many other objects lie close to object i and so it occupies a dense area of attribute space. Sort the objects from the data matrix into increasing order of a(i). This means that the “densest” objects will be placed at the top of the array and so treated first.

Now work in terms of a threshold, which is reset to R=a(i) at each step and for each new object i and value of a(i), we have four possibilities.

1. The next object i is separated from all other “dense” points by a distance greater than R. In this case, this object starts a new cluster.

2. The next object i is within distance R of only one cluster. In this case, add object i to this cluster.

3. The next object is within distance R of more than one cluster. In this case the clusters are merged together and the object is added to the new combined cluster.

4. Each time the distance R is changed, all the clusters have to be checked and any which are now less than distance R apart are merged together.

The cluster nucleii are defined as “dense” points and so are any other objects within distance R of them. All other objects are “unclassified” at this stage. The process continues until some finishing criterion has been satisfied. This may be expressed as an upper limit on R, or the percentage of objects to be classified, or the minimum number of clusters required in the final configuration.

References.

Clustan package. sections MODE and DENSITY.
Everitt. Cluster Analysis 2nd edition.

Copyright (c) Susan Laflin 1998.