Reciprocal Nearest Neighbor Clustering


Important implementation notes

How to determine the final clusters

  • in Leibe's paper you will find pseudo source code for the RNN clustering algorithm
  • but: the main point is not mentioned! what are the final clusters? Quite funny that neither they do mention it in the pseudo code, nor in the paper…
  • for this issue have a look into López-Sastre's paper
  • it looks similar to the pseudo-code mentioned in Leibe's paper, doesn't it? but compare the discard NN chain lines…
  • in López-Sastre's paper you will find the important hint, that each time you discard a NN-chain, you can put all the clusters currently in the NN-chain into the list of final clusters

Special cases

  • both paper's don't tell you what to do in special cases!
  • here are the two most important special cases, which are not covered by both Leibe's and López-Sastre's pseudo code
  • a cluster can be in exactly one of these three lists:
    • the NN-chain list L
    • R, the list of open clusters still to consider
    • the final cluster list C
  • special case #1: the algorithm stops with R is empty, but in the NN-chain list L there are still some clusters –> we have to check whether we can still merge them further, if not, we discard the list and put the remaining clusters into the final cluster list C
  • special case #2: the algorithm stops with an empty NN-chain L and R has exactly one cluster v inside –> we cannot find a NN v2 for this cluster from the list of open clusters R (since there is only v inside)! so we put v into the final cluster list C

Reference Implementation

What's the key idea?

  • RNN clustering is based on the construction of reciprocal nearest neighbor pairs (RNN pairs),that is pairs of clusters a and b, such that a is b’s nearest neighbor and vice versa
  • why are RNNs so cool?
  • the merging of the corresponding clusters a,b of a reciprocal nearest-neighbor pair to a 3rd new cluster ab does not alter the nearest-neighbor relations of any other clusters c and d
  • i.e. after merging a,b to a new cluster ab we just have to recompute the distances between ab and all other clusters c, but that's it!
  • BUT: this is only true if we make sure Bruynooghe’s reducibility property is fulfilled!
  • this criterion demands a cluster similarity measure Sim() such that when two clusters a and b are agglomerated, the similarity of the merged cluster ab to any third cluster c may only decrease – compared to the state before the merging action
  • and which cluster similarity measure Sim() fulfill this criteron?
  • among others:
    • group average criterion – regardless of the used VecSim() measure
    • centroid criterion based on correlation – but not on Euclidian distances!

How can we design a computational efficient clustering based on RNNs?

  • the basic idea here is to construct a nearest-neighbor chain
  • what's that?
  • it consists of an arbitrary start cluster, which is followed by its nearest neighbor, which is again followed by its nearest neighbor from among the remaining points, and so on
  • note that when the nearest neighbor for the last chain element n is the element n-1, we stop with the generation with the NN chain
  • we have found a RNN pair and can agglomerate it
  • and now comes the actual point!
  • the nearest-neighbor assignments stay valid for the remaining chain members, which can thus be reused for the next iteration of agglomeration
  • if we do not find any RNN pair more in the remaining NN chain, we create a new one for – again – a random start point

Which similarity measure Sim() to use?

  • we can recompute new distances very fast – 'fast' means: not depending on the number of vecs assigned to a cluster, but in constant time
  • for this we use the group average Sim() measure
  • and represent the 'centroid' of a cluster by the mean and the variance, which can be updated iteratively when merging clusters
  • for details see (Leibe et al.)'s paper

RNN clustering demo #1

2D Vectors to be clustered

Cluster steps

Final clusters

RNN clustering demo #2

This demo will show you the RNN clustering procedure for a larger set of sample vectors to be clustered. Note that the resulting cluster size heavily depends on the threshold t to be defined by the user to launch the clustering process with.

2D Vectors to be clustered

Cluster steps

Final clusters

public/reciprocal_nearest_neighbour_clustering_rnn.txt · Last modified: 2012/06/21 16:51 (external edit) · []
Recent changes RSS feed Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki