BIRCH

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle 'noise' effectively", beating DBSCAN by two months. The algorithm received the SIGMOD 10 year test of time award in 2006.

Problem with previous methods

Previous clustering algorithms performed less effectively over very large databases and did not adequately consider the case wherein a data-set was too large to fit in main memory. As a result, there was a lot of overhead maintaining high clustering quality while minimizing the cost of additional IO operations. Furthermore, most of BIRCH's predecessors inspect all data points equally for each 'clustering decision' and do not perform heuristic weighting based on the distance between these data points.

Advantages with BIRCH

It is local in that each clustering decision is made without scanning all data points and currently existing clusters.
It exploits the observation that the data space is not usually uniformly occupied and not every data point is equally important.
It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs.
It is also an incremental method that does not require the whole data set in advance.

Algorithm

The BIRCH algorithm takes as input a set of data points, represented as real-valued vectors, and a desired number of clusters. It operates in four phases, the second of which is optional.
The first phase builds a clustering feature tree out of the data points, a height-balanced tree data structure, defined as follows:

Given a set of N d-dimensional data points, the clustering feature of the set is defined as the triple, where is the linear sum and is the square sum of data points.
Clustering features are organized in a CF tree, a height-balanced tree with two parameters: branching factor and threshold. Each non-leaf node contains at most entries of the form, where is a pointer to its th child node and the clustering feature representing the associated subcluster. A leaf node contains at most entries each of the form . It also has two pointers prev and next which are used to chain all leaf nodes together. The tree size depends on the parameter. A node is required to fit in a page of size. and are determined by. So can be varied for performance tuning. It is a very compact representation of the dataset because each entry in a leaf node is not a single data point but a subcluster.

In the second step, the algorithm scans all the leaf entries in the initial tree to rebuild a smaller tree, while removing outliers and grouping crowded subclusters into larger ones. This step is marked optional in the original presentation of BIRCH.
In step three an existing clustering algorithm is used to cluster all leaf entries. Here an agglomerative hierarchical clustering algorithm is applied directly to the subclusters represented by their vectors. It also provides the flexibility of allowing the user to specify either the desired number of clusters or the desired diameter threshold for clusters. After this step a set of clusters is obtained that captures major distribution pattern in the data. However, there might exist minor and localized inaccuracies which can be handled by an optional step 4. In step 4 the centroids of the clusters produced in step 3 are used as seeds and redistribute the data points to its closest seeds to obtain a new set of clusters. Step 4 also provides us with an option of discarding outliers. That is a point which is too far from its closest seed can be treated as an outlier.

Calculations with the cluster features

Given only the, the same measures can be calculated without the knowledge of the underlying actual values.

Centroid:
Radius:
Average Linkage Distance between clusters and :

In multidimensional cases the square root should be replaced with a suitable norm.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...