# Reciprocal Nearest Neighbor Clustering

## Important implementation notes

### How to determine the final clusters

• in Leibe's paper you will find pseudo source code for the RNN clustering algorithm
• but: the main point is not mentioned! what are the final clusters? Quite funny that neither they do mention it in the pseudo code, nor in the paper…
• for this issue have a look into López-Sastre's paper
• it looks similar to the pseudo-code mentioned in Leibe's paper, doesn't it? but compare the `discard NN chain` lines…
• in López-Sastre's paper you will find the important hint, that each time you discard a NN-chain, you can put all the clusters currently in the NN-chain into the list of final clusters

### Special cases

• both paper's don't tell you what to do in special cases!
• here are the two most important special cases, which are not covered by both Leibe's and López-Sastre's pseudo code
• a cluster can be in exactly one of these three lists:
• the NN-chain list `L`
• `R`, the list of open clusters still to consider
• the final cluster list `C`
• special case #1: the algorithm stops with R is empty, but in the NN-chain list `L` there are still some clusters –> we have to check whether we can still merge them further, if not, we discard the list and put the remaining clusters into the final cluster list `C`
• special case #2: the algorithm stops with an empty NN-chain `L` and `R` has exactly one cluster `v` inside –> we cannot find a NN `v2` for this cluster from the list of open clusters `R` (since there is only `v` inside)! so we put `v` into the final cluster list `C`

## What's the key idea?

• RNN clustering is based on the construction of reciprocal nearest neighbor pairs (RNN pairs),that is pairs of clusters `a` and `b`, such that `a` is `b`’s nearest neighbor and vice versa
• why are RNNs so cool?
• the merging of the corresponding clusters `a,b` of a reciprocal nearest-neighbor pair to a 3rd new cluster `ab` does not alter the nearest-neighbor relations of any other clusters `c` and `d`
• i.e. after merging a,b to a new cluster `ab` we just have to recompute the distances between `ab` and all other clusters `c`, but that's it!
• BUT: this is only true if we make sure Bruynooghe’s reducibility property is fulfilled!
• this criterion demands a cluster similarity measure `Sim()` such that when two clusters `a` and `b` are agglomerated, the similarity of the merged cluster `ab` to any third cluster `c` may only decrease – compared to the state before the merging action
• and which cluster similarity measure `Sim()` fulfill this criteron?
• among others:
• group average criterion – regardless of the used `VecSim()` measure
• centroid criterion based on correlation – but not on Euclidian distances!

## How can we design a computational efficient clustering based on RNNs?

• the basic idea here is to construct a nearest-neighbor chain
• what's that?
• it consists of an arbitrary start cluster, which is followed by its nearest neighbor, which is again followed by its nearest neighbor from among the remaining points, and so on
• note that when the nearest neighbor for the last chain element n is the element n-1, we stop with the generation with the NN chain
• we have found a RNN pair and can agglomerate it
• and now comes the actual point!
• the nearest-neighbor assignments stay valid for the remaining chain members, which can thus be reused for the next iteration of agglomeration
• if we do not find any RNN pair more in the remaining NN chain, we create a new one for – again – a random start point

## Which similarity measure Sim() to use?

• we can recompute new distances very fast – 'fast' means: not depending on the number of vecs assigned to a cluster, but in constant time
• for this we use the group average `Sim()` measure
• and represent the 'centroid' of a cluster by the mean and the variance, which can be updated iteratively when merging clusters
• for details see (Leibe et al.)'s paper

## RNN clustering demo #2

This demo will show you the RNN clustering procedure for a larger set of sample vectors to be clustered. Note that the resulting cluster size heavily depends on the threshold `t` to be defined by the user to launch the clustering process with.