GitHub - srirambaskaran/efficient-hierarchical-clustering

Efficient Hierarchical Clustering

This is an efficient implementation of hierarchical agglomerative clustering algorithm. This is more intuitive than other versions of this algorithm. This algorithm has been extracted and modified from the book chapter from Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (reference here). The original description of the algorithm is given below.

for n = 1 to N
    for i = 1 to N
        do C[n][i] = SIM(d[n],d[i])
    I[n] = true //setting active clusters

for k = 1 to N-1
    <i,m> = argmax (I[i] && I[m] && i!= m ) C[i][m]
    mergedCluster = merge((i,m))

    for j = 1 to N
        C[i][j] = C[j][i] = SIM(mergedCluster, d[j])

    I[m] = false //deactivating

The distance measure chosen for this implementation is centroid based, i.e., we choose the average of all points as the representative point for a cluster.

As you can see, the algorithm takes more time and repeats the identification of argmax a lot of times. An efficient implementation is the use of priority queue that always keeps the max value at the top of the heap and we pick the top of the heap and rearrange the heap using an efficient heapify algorithm.

Q = [] //priority queue
for n = 1 to N
    for i = 1 to N
        Q.enqueue(SIM(d[i], d[n]), (i, n))

for k = 1 to N-1
    <i,m> = Q.dequeue()
    mergedCluster = merge((i,m))

    Q.remove((_,m)) //remove any similarity that includes m

    for j = 1 to N
        Q.update((i,j), SIM(mergedCluster, d[j]))

There is a repeated removal from Q when i,m are merged. This can be avoided with a new implementation where we maintain multiple priority queues and find the maximum of each priority queue and pick the one with max value. We can ignore the priority queue for m. When considering amortized analysis of this algorithm, this efficiently performs the deletion and restricts the number of calls to Heapify algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Hierarchical Clustering

About

Releases

Packages

Languages

srirambaskaran/efficient-hierarchical-clustering

Folders and files

Latest commit

History

Repository files navigation

Efficient Hierarchical Clustering

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages