Binning (metagenomics)

In metagenomics, binning is the process of grouping reads or contigs and assigning them to operational taxonomic units. Binning methods can be based on either compositional features or alignment, or both.

Introduction

Metagenomic samples can contain reads from a huge number of organisms. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome. Metagenomic studies sample DNA from the whole community, and make it available as nucleotide sequences of certain length. In most cases, the incomplete nature of the obtained sequences makes it hard to assemble individual genes, much less recovering the full genomes of each organism. Thus, binning techniques represent a "best effort" to identify reads or contigs with certain groups of organisms designated as operational taxonomic units.
The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample. These marker genes had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned.
Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups.

Algorithms

Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against databases, and try to separate sequence based in organism-specific characteristics of the DNA, like GC-content.
Mande et al., provides a review of the premise, methodologies, advantages, limitations and challenges of various methods available for binning of metagenomic datasets obtained using the shotgun sequencing approach. Some of the prominent binning algorithms are described below.

TETRA

TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments. There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.

MEGAN

In the DIAMOND+MEGAN approach, all reads are first aligned against a protein reference database, such as NCBI-nr, and then the resulting alignments are analyzed using the naive LCA algorithm, which places a read on the lowest taxonomic node in the NCBI taxonomy that lies above all taxa to which the read has a significant alignment. Here, an alignment is usually deemed "significant", if its bit score lies above a given threshold and is within 10%, say, of the best score seen for that read. The rationale of using protein reference sequences, rather than DNA reference sequences, is that current DNA reference databases only cover a small fraction of the true diversity of genomes that exist in the environment.

Phylopythia

Phylopythia is one supervised classifier developed by researchers at IBM labs, and is basically a support vector machine trained with DNA-kmers from known sequences.

SOrt-ITEMS

SOrt-ITEMS is an alignment-based binning algorithm developed by Innovations Labs of Tata Consultancy Services Ltd., India. Users need to perform a similarity search of the input metagenomic sequences against the nr protein database using BLASTx search. The generated blastx output is then taken as input by the SOrt-ITEMS program. The method uses a range of BLAST alignment parameter thresholds to first identify an appropriate taxonomic level where the read can be assigned. An orthology-based approach is then adopted for the final assignment of the metagenomic read. Other alignment-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services include DiScRIBinATE, ProViDE and SPHINX. The methodologies of these algorithms are summarized below.

DiScRIBinATE

DiScRIBinATE is an alignment-based binning algorithms developed by the Innovations Labs of Tata Consultancy Services Ltd., India. DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. Incorporating this alternate strategy was observed to reduce the binning time by half without any significant loss in the accuracy and specificity of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE was seem to reduce the overall misclassification rate.

ProViDE

ProViDE is an alignment-based binning approach developed by the Innovation Labs of Tata Consultancy Services Ltd. for the estimation of viral diversity in metagenomic samples. ProViDE adopts the reverse orthlogy based approach similar to SOrt-ITEMS for the taxonomic classification of metagenomic sequences obtained from virome datasets. It a customized set of BLAST parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom.

PCAHIER

PCAHIER, another binning algorithm developed by the Georgia Institute of Technology., employs n-mer oligonucleotide frequencies as the features and adopts a hierarchical classifier for binning short metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms.

SPHINX

SPHINX, another binning algorithm developed by the Innovation Labs of Tata Consultancy Services Ltd., adopts a hybrid strategy that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. The approach was designed with the objective of analyzing metagenomic datasets as rapidly as composition-based approaches, but nevertheless with the accuracy and specificity of alignment-based algorithms. SPHINX was observed to classify metagenomic sequences as rapidly as composition-based algorithms. In addition, the binning efficiency of SPHINX was observed to be comparable with results obtained using alignment-based algorithms.

INDUS{{Cite journal

| title = INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences.

Other algorithms

This list is not exhaustive:

TACOA
Parallel-META
PhyloPythiaS
RITA
BiMeta
MetaPhlAn
SeMeta
Quikr
Taxoner

All these algorithms employ different schemes for binning sequences, such as hierarchical classification, and operate in either a supervised or unsupervised manner. These algorithms provide a global view of how diverse the samples are, and can potentially connect community composition and function in metagenomes.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...