Machine learning in bioinformatics


Machine learning, a subfield of computer science involving the development of algorithms that learn how to make predictions based on data, has a number of emerging applications in the field of bioinformatics. Bioinformatics deals with computational and mathematical approaches for understanding and processing biological data.
Prior to the emergence of machine learning algorithms, bioinformatics algorithms had to be explicitly programmed by hand which, for problems such as protein structure prediction, proves extremely difficult. Machine learning techniques such as deep learning enable the algorithm to make use of automatic feature learning which means that based on the dataset alone, the algorithm can learn how to combine multiple features of the input data into a more abstract set of features from which to conduct further learning. This multi-layered approach to learning patterns in the input data allows such systems to make quite complex predictions when trained on large datasets. In recent years, the size and number of available biological datasets have skyrocketed, enabling bioinformatics researchers to make use of these machine learning systems. Machine learning has been applied to six biological domains: genomics, proteomics, microarrays, systems biology, evolution, and text mining.

Applications

Genomics

involves the study of the genome, the complete DNA sequence, of organisms. While genomic sequence data has historically been sparse due to the technical difficulty in sequencing a piece of DNA, the number of available sequences is growing exponentially. However, while raw data is becoming increasingly available and accessible, the biological interpretation of this data is occurring at a much slower pace. Therefore, there is an increasing need for the development of machine learning systems that can automatically determine the location of protein-encoding genes within a given DNA sequence. This is a problem in computational biology known as gene prediction.
Gene prediction is commonly performed through a combination of what are known as extrinsic and intrinsic searches. For the extrinsic search, the input DNA sequence is run through a large database of sequences whose genes have been previously discovered and their locations annotated. A number of the sequence's genes can be identified by determining which strings of bases within the sequence are homologous to known gene sequences. However, given the limitation in size of the database of known and annotated gene sequences, not all the genes in a given input sequence can be identified through homology alone. Therefore, an intrinsic search is needed where a gene prediction program attempts to identify the remaining genes from the DNA sequence alone.
Machine learning has also been used for the problem of multiple sequence alignment which involves aligning many DNA or amino acid sequences in order to determine regions of similarity that could indicate a shared evolutionary history.
It can also be used to detect and visualize genome rearrangements.

Proteomics

s, strings of amino acids, gain much of their function from protein folding in which they conform into a three-dimensional structure. This structure is composed of a number of layers of folding, including the primary structure, the secondary structure, the tertiary structure, and the quartenary structure.
Protein secondary structure prediction is a main focus of this subfield as the further protein foldings are determined based on the secondary structure. Solving the true structure of a protein is an incredibly expensive and time-intensive process, furthering the need for systems that can accurately predict the structure of a protein by analyzing the amino acid sequence directly. Prior to machine learning, researchers needed to conduct this prediction manually. This trend began in 1951 when Pauling and Corey released their work on predicting the hydrogen bond configurations of a protein from a polypeptide chain. Today, through the use of automatic feature learning, the best machine learning techniques are able to achieve an accuracy of 82-84%. The current state-of-the-art in secondary structure prediction uses a system called DeepCNF which relies on the machine learning model of artificial neural networks to achieve an accuracy of approximately 84% when tasked to classify the amino acids of a protein sequence into one of three structural classes. The theoretical limit for three-state protein secondary structure is 88–90%.
Machine learning has also been applied to proteomics problems such as protein side-chain prediction, protein loop modeling, and protein contact map prediction.

Microarrays

Microarrays, a type of lab-on-a-chip, are used for automatically collecting data about large amounts of biological material. Machine learning can aid in the analysis of this data, and it has been applied to expression pattern identification, classification, and genetic network induction.
This technology is especially useful for monitoring the expression of genes within a genome, aiding in diagnosing different types of cancer based on which genes are expressed. One of the main problems in this field is identifying which genes are expressed based on the collected data. In addition, due to the huge number of genes on which data is collected by the microarray, there is a large amount of irrelevant data to the task of expressed gene identification, further complicating this problem. Machine learning presents a potential solution to this problem as various classification methods can be used to perform this identification. The most commonly used methods are radial basis function networks, deep learning, Bayesian classification, decision trees, and random forest.

Systems biology

Systems biology focuses on the study of the emergent behaviors from complex interactions of simple biological components in a system. Such components can include molecules such as DNA, RNA, proteins, and metabolites.
Machine learning has been used to aid in the modelling of these complex interactions in biological systems in domains such as genetic networks, signal transduction networks, and metabolic pathways. Probabilistic graphical models, a machine learning technique for determining the structure between different variables, are one of the most commonly used methods for modeling genetic networks. In addition, machine learning has been applied to systems biology problems such as identifying transcription factor binding sites using a technique known as Markov chain optimization. Genetic algorithms, machine learning techniques which are based on the natural process of evolution, have been used to model genetic networks and regulatory structures.
Other systems biology applications of machine learning include the task of enzyme function prediction, high throughput microarray data analysis, analysis of genome-wide association studies to better understand markers of disease, protein function prediction.

Stroke Diagnosis

Machine learning methods for analysis of neuroimaging data are used to help diagnose stroke. Three-dimensional CNN and SVM methods are often used.

Text mining

The increase in available biological publications led to the issue of the increase in difficulty in searching through and compiling all the relevant available information on a given topic across all sources. This task is known as knowledge extraction. This is necessary for biological data collection which can then in turn be fed into machine learning algorithms to generate new biological knowledge. Machine learning can be used for this knowledge extraction task using techniques such as natural language processing to extract the useful information from human-generated reports in a database. Text Nailing, an alternative approach to machine learning, capable of extracting features from clinical narrative notes was introduced in 2017.
This technique has been applied to the search for novel drug targets, as this task requires the examination of information stored in biological databases and journals. Annotations of proteins in protein databases often do not reflect the complete known set of knowledge of each protein, so additional information must be extracted from biomedical literature. Machine learning has been applied to automatic annotation of the function of genes and proteins, determination of the subcellular localization of a protein, analysis of DNA-expression arrays, large-scale protein interaction analysis, and molecule interaction analysis.
Another application of text mining is the detection and visualization of distinct DNA regions given sufficient reference data.