Structural bioinformatics


Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, and binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. Structural bioinformatics main objectives are the creation of new methods to deal with biological macromolecules data to solve problems in biology and generate new knowledge.

Introduction

Protein structure

The protein structure is directly related to its function. The presence of some chemical groups in specific locations allows proteins to act as enzymes, catalyzing several chemical reactions. In general, protein structures are classified into four levels: primary, secondary, tertiary, and quaternary. Structural bioinformatics mainly addresses interactions among structures taking into consideration their space coordinates. Thus, the primary structure is better analyzed in traditional branches of bioinformatics. However, the sequence implies restrictions that allow the formation of conserved local conformations of the polypeptide chain, such as alpha-helix, beta-sheets, and loops. Also, weak interactions stabilize the protein fold. Interactions could be intrachain, i.e., when occurring between parts of the same protein monomer, or interchain, i.e., when occurring between different structures.

Structure visualization

Protein structure visualization is an important issue for structural bioinformatics. It allows users to observe static or dynamic representations of the molecules, also allowing the detection of interactions that could be used to infer about molecular mechanisms studied. The most common types of visualization are:
The classic DNA duplexes structure was initially described by Watson and Crick. The DNA molecule is composed of three substances: a phosphate group, a pentose, and a nitrogen base. The DNA double helix structure is stabilized by hydrogen bonds formed between base pairs: adenine with thymine and cytosine with guanine. Many structural bioinformatics studies have focused on understanding interactions between DNA and small molecules, which has been the target of several drug design studies.

Interactions

Interactions are contacts established between parts of molecules at different levels. They are responsible for stabilizing protein structures and perform a varied range of activities. In biochemistry, interactions are characterized by the proximity of atom groups or molecules regions that present an effect upon one another, such as electrostatic forces, hydrogen bonding, and hydrophobic effect. Proteins can perform several types of interactions, such as protein-protein interactions, protein-peptide interactions', protein-ligand interactions ', and protein-DNA interaction.

Calculating contacts

Calculating contacts is an important task in structural bioinformatics, being important for the correct prediction of protein structure and folding, thermodynamic stability, protein-protein and protein-ligand interactions, docking and molecular dynamics analyses, and so on.
Traditionally, computational methods have used threshold distance between atoms to detect possible interactions. This detection is performed based on Euclidean distance and angles between atoms of determined types. However, most of the methods based on simple Euclidean distance cannot detect occluded contacts. Hence, cutoff free methods, such as Delaunay triangulation, have gained prominence in recent years. In addition, the combination of a set of criteria, for example, physicochemical properties, distance, geometry, and angles, have been used to improve the contact determination.
TypeMax distance criteria
Hydrogen bond3,9 Å
Hydrophobic interaction5 Å
Ionic interaction6 Å
Aromatic Stacking6 Å

Protein Data Bank (PDB)

The Protein Data Bank is a database of 3D structure data for large biological molecules, such as proteins, DNA, and RNA. PDB is managed by an international organization called the Worldwide Protein Data Bank, which is composed of several local organizations, as. PDBe, PDBj, RCSB, and BMRB. They are responsible for keeping copies of PDB data available on the internet at no charge. The number of structure data available at PDB has increased each year, being obtained typically by X-ray crystallography, NMR spectroscopy, or cryo-electron microscopy.

Data format

The PDB format is the legacy textual file format used to store information of three-dimensional structures of macromolecules used by the Protein Data Bank. Due to restrictions in the format structure conception, the PDB format does not allow large structures containing more than 62 chains or 99999 atom records.
The PDBx/mmCIF is a standard text file format for representing crystallographic information. Since 2014, the PDB format was substituted as the standard PDB archive distribution by the PDBx/mmCIF file format. While PDB format contains a set of records identified by a keyword of up six characters, the PDBx/mmCIF format uses a structure based on key and value, where the key is a name that identifies some feature and the value is the variable information.

Other structural databases

In addition to the Protein Data Bank, there are several databases of protein structures and other macromolecules. Examples include:

Structural alignment

is a method for comparison between 3D structures based on their shape and conformation. It could be used to infer the evolutionary relationship among a set of proteins even with low sequence similarity. Structural alignment implies in superimpose a 3D structure under a second one, rotating and translating atoms in corresponding positions. Usually, the alignment quality is evaluated based on the root-mean-square deviation of atomic positions, i.e., the average distance between atoms after superimposition:
where δi is the distance between atom i and either a reference atom corresponding in the other structure or the mean coordinate of the N equivalent atoms. In general, the RMSD outcome is measured in Ångström unit, which is equivalent to 10−10 m. The nearer to zero the RMSD value, the more similar are the structures.

Graph-based structural signatures

Structural signatures, also called fingerprints, are macromolecule pattern representations that can be used to infer similarities and differences. Comparisons among a large set of proteins using RMSD still is a challenge due to the high computational cost of structural alignments. Structural signatures based on graph distance patterns among atom pairs have been used to determine protein identifying vectors and to detect non-trivial information. Furthermore, algebra linear and machine learning can be used for clustering protein signatures, detecting protein-ligand interactions, predicting ΔΔG, and proposing mutations based on Euclidean distance.

Structure prediction

The atomic structures of molecules can be obtained by several methods, such as X-ray crystallography, NMR spectroscopy, and 3D electron microscopy; however, these processes can present high costs and sometimes some structures can be hardly established, such as membrane proteins. Hence, it is necessary to computational approaches for determining 3D structures of macromolecules. The structure prediction methods are classified into comparative modeling and de novo modeling.

Comparative modeling

, also known as homology modeling, corresponds to the methodology to construct three-dimensional structures from an amino acid sequence of a target protein and a template with known structure. The literature has described that evolutionarily related proteins tend to present a conserved three-dimensional structure. In addition, sequences of distantly related proteins with identity lower than 20% can present different folds.

''De novo'' modeling

In structural bioinformatics, de novo modeling, also known as ab initio modeling, refers to approaches for obtaining three-dimensional structures from sequences without the necessity of a homologous known 3D structure. Despite the new algorithms and methods proposed in the last years, de novo protein structure prediction is still considered one of the remain outstanding issues in modern science.

Structure validation

After structure modeling, an additional step of structure validation is necessary since many of both comparative and 'de novo' modeling algorithms and tools use heuristics to try assembly the 3D structure, which can generate many errors. Some validation strategies consist of calculating energy scores and comparing them with experimentally determined structures. For example, the DOPE score is an energy score used by the MODELLER tool for determining the best model.
Another validation strategy is calculating φ and ψ backbone dihedral angles of all residues and construct a Ramachandran plot. The side-chain of amino acids and the nature of interactions in the backbone restrict these two angles, and thus, the visualization of allowed conformations could be performed based on the Ramachandran plot. A high quantity of amino acids allocated in no permissive positions of the chart is an indication of a low-quality modeling.

Prediction tools

A list with commonly used software tools for protein structure prediction, including comparative modeling, protein threading, de novo protein structure prediction, and secondary structure prediction is available in the list of protein structure prediction software.

Molecular docking

is a method used to predict the orientation coordinates of a molecule when linked to another one. Molecular docking aims to predict possible poses of the ligand when it interacts with specific regions, generally restricted by a box, in the receptor. Docking tools can use force fields to estimate a score for ranking best poses that favored better interactions.
In general, docking protocols are used to predict the interactions between small molecules and proteins. However, docking also can be used to detect associations and binding modes among proteins, peptides, DNA or RNA molecules, carbohydrates, and other macromolecules.

Virtual screening

is a computational approach used to fast screening of large compound libraries for drug discovery. Usually, virtual screening uses docking algorithms to rank small molecules with the highest affinity to a target receptor.
In recent times, several tools have been used to evaluate the use of virtual screening in the process of discovering new drugs. However, problems such as missing information, inaccurate understanding of drug-like molecules properties, weak scoring functions, or insufficient docking strategies hinder the docking process. Hence, the literature has described that it is still not considered a mature technology.

Molecular dynamics

is a computational method for simulating interactions between molecules and their atoms during a given period of time. This method allows the observation of the behavior of molecules and their interactions, considering the system as a whole. To calculate the behavior of the systems and, thus, determine the trajectories, an MD can use Newton's equation of motion, in addition to using molecular mechanics methods to estimate the forces that occur between particles.

Applications

approaches used in structural bioinformatics are: