Commonly, speaking about the clustering of objects into GIS, people imply their grouping by spatial location (see ConcaveHull for QGIS Plugin, SciPy Clustering). However, equally interesting is the clustering by attributes, when the spatial position of the object plays no role (only if it is not formalized as an attribute), and significant are some of its numerical characteristics. Imagine that each object in Leningrad region is described by two values A and B (it can be anything - the area, the indicator of prestige or of conflict, a number of museums, the average concentration of any substance in the atmosphere, etc. In this case, the values obtained randomly).

Since the spatial position of objects not interesting foo us, lets look at the same objects in two-dimensional space where the coordinate axes are these indicators A and B:

In this case, since the indicators are heterogeneous (i.e. they have a different metric, conventionally, in A - environmental indicator B - indicator associated with people), in order to be comparable, they were normalized (maximum value of each indicator is taken as 1, the others are changed proportionally). That is why the axes of the graph shows not the absolute values A and B. the Normalization is not a mandatory procedure, but it provides a relatively equal contribution of each measure in clustering, regardless of their metrics. Sometimes, if, for example, clustering is carried out according to some "point" variables already given in the researcher need scale normalization can harm.

Returning to the obtained schedule, let us aim to highlight the four categories and assign each object to one of them, based on their proximity in this two-dimensional feature space A - B. In this case, visually easy to identify four groups.

By setting the relevant areas of group numbers, in which they were, we solve the clustering problem. However, developing the idea, it is possible to face a lot of complications: if objects will be much more, so groups will not stand out so clearly. What if you make the group by not two values, but N (in N-dimensional feature space)? And the desired number of clusters can be very different. Visual graph is already no cost!

The weighted hierarchical clustering

The most simple, but understandable from the point of view of logic algorithm - hierarchical clustering. Its essence lies in the sequential Association closest to each other of objects in groups as long as such groups do not remain as much as is required to allocate clusters. Describe the principle on the example of the same areas of the Leningrad region. The first step is to consider each individual object belonging to its own, independent group (in our case, we start with 18). Calculated the distances between all these groups and... Stop, what distance is? The easiest way: distance in N-dimensional Euclidean space, calculated by the formula:

The result of the calculation according to this formula (in this case with equal weights = 1) is the matrix of distances (to optimize the use of resources is only one half of it, the second would be symmetrical). Now we find in it a minimum distance: 0.06. It corresponds to two objects that are closest to each other in our feature space (objects 1 and 15). Combine them in one group, counting their average position. Now the cluster has diminished by one: and two dots merged into one. Look at the chart again:

It is well visible that were United two closest points. Returning to the calculation of the distances (now new, because the situation has changed) and build a matrix. Its dimension is, of course, will decrease by 1. Repeat the search for the minimum distance and merging points. It some time, find out that there are only 4 points:

It is, in fact, centers for the required clusters. Mark, what the original point was, in the end, any of those 4, we get a list of points participating in each group. And this is the desired result. Painting the areas of the Leningrad region in accordance with the group memberships, we can visualize the results of clustering:

The algorithm will handle almost any number of attribute (parameters), for which the researcher need to cluster the objects, and the ability to set the weight makes it quite flexible. But there are significant downside is the necessity of calculation of distance matrix (with size NxN, where N is the current number of groups). If your data set contains 1000 objects, the matrix will initially contain 1,000,000 cells, of which ~500000 will have to count distances (with minimal optimization). When working with the developed module that implements a hierarchical method, as will be shown next, relatively comfortable, you can handle layers with hundreds and thousands of objects first. The main advantage, in the opinion of the author, is to maximize the intelligibility and naturalness of what is happening in the method, its simplicity.

K-Means

K-Means is a classic method widely described in the literature and articles on the Internet. The General idea is as follows: initially selected N (the number of required groups) centers which are then iteratively perevisische thus, to minimize the total squared deviation of points attributable to these wikipedia. The existing implementation (used for module scipy.cluster provide very fast handling of large data sets, however, performance has a price - instability due to a strong dependence on the initial choice of clusters (in this case random). To compensate for this instability can be only increasing the number of iterations of recalculation of the cluster centers, which obviously has a negative impact on the same performance. When working with large data sets, when hierarchical clustering requires too many resources, K-Means helps out and shows good results. In addition to instability, among the disadvantages we can mention that in the scipy implementation has no support of weights.

Attribute based clustering QGIS Plugin

The developed module implements the clustering by attributes in the ways described above. Hierarchical clustering is native and requires no additional libraries, K-Means requires SciPy. Is licensed under the GNU GPL v2, and supported by QGIS 2.0 and above.

The main result of this module is to create a new (or overwrite existing) attribute of the vector layer and write cluster number to every object. The user selects the studied vector layer, specifies the attributes (only numeric) that will be used as the parameters for clustering, if necessary, give them weight, choose a method of clustering, specifies the number of clusters, the name of the attribute to record the results, and sets or clears the flag to indicate whether to normalize (or not) attribute values. Supports two clustering algorithm:

- K-Means (unweighted, scipy, fast)

- Hierarchical (weighted, native implementation, slow)

Interface is very simple:

Defining layer, the user alternately selecting the attributes and clicking the "--->" adds them to the table. Added attributes can be removed and also weight can be defined (for hierarchical clustering). In the image below, the attributes A and B layers of the districts of the Leningrad region selected as signs of clustering, and the weight of the attribute A is reduced to 0.5. Selected hierarchical clustering, goal - 6 classes, and normalization of values active.

As an example, lets try clustering of the regions of Germany according to socio-economic indicators.

For example, let's use the layer, assembled from public sources (the author - Anna Ivanova) containing districts (444 objects) of Germany with the parameters determining the development of agriculture, industry and service sectors in each area.

Using a hierarchical clustering algorithm, select five types of areas on these three indicators, with the scope of services we are interested to a lesser degree than other signs. Set the appropriate options:

The result - a new attribute in the layer containing the cluster number for each object. Customizing the style, you'll receive the map typification of German districts into 5 categories:

Having carefully studied the resulting map, you can find that in one of the clusters was only one object. Looks like a mistake?

However, if you look at 3-dimensional (after all, we used three attributes) in the feature space, we see that this lone object is really well separated from the rest:

and the clustering algorithm did everything reasonably

Silent Eddie, 2016