Biomedical text mining

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Considerations

Applying text mining approaches to biomedical text requires specific considerations common to the domain.

Availability of annotated text data

Large annotated corpora used in the development and training of general purpose text mining methods are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora. Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside challenges and biomedical informatics researchers. Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System and Medical Subject Headings.
Machine learning-based methods often require very large data sets as training data to build useful models. Manual annotation of large text corpora is not realistically possible. Training data may therefore be products of weak supervision or purely statistical methods.

Data structure variation

Like other text documents, biomedical documents contain unstructured data. Research publications follow different formats, contain different types of information, and are interspersed with figures, tables, and other non-text content. Both unstructured text and semi-structured document elements, such as tables, may contain important information that should be text mined. Clinical documents may vary in structure and language between departments and locations. Other types of biomedical text, such as drug labels, may follow general structural guidelines but lack further details.

Uncertainty

Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.

Supporting clinical needs

Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians. This is a concern in environments where clinical decision support is expected to be informative and accurate.

Interoperability with clinical systems

New text mining systems must work with existing standards, electronic medical records, and databases. Methods for interfacing with clinical systems such as LOINC have been developed but require extensive organizational effort to implement and maintain.

Patient privacy

Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.

Processes

Specific sub tasks are of particular concern when processing biomedical text.

Named entity recognition

Developments in biomedical text mining have incorporated identification of biological entities with named entity recognition, or NER. Names and identifiers for biomolecules such as proteins and genes, chemical compounds and drugs, and disease names have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating deep learning and word embeddings have also been successful at biomedical NER.

Document classification and clustering

Biomedical documents may be classified or clustered based on their contents and topics. In classification, document categories are specified manually, while in clustering, documents form algorithm-dependent, distinct groups. These two tasks are representative of supervised and unsupervised methods, respectively, yet the goal of both is to produce subsets of documents based on their distinguishing features. Methods for biomedical document clustering have relied upon k-means clustering.

Relationship discovery

Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time, or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.

Hedge cue detection

The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.

Claim detection

Multiple researchers have developed methods to identify specific scientific claims from literature. In practice, this process involves both isolating phrases and sentences denoting the core arguments made by the authors of a document and comparing claims to find potential contradictions between them.

Information extraction

, is the process of automatically identifying structured information from unstructured or partially structured text. IE processes can involve several or all of the above activities, including named entity recognition, relationship discovery, and document classification, with the overall goal of translating text to a more structured form, such as the contents of a template or knowledge base. In the biomedical domain, IE is used to generate links between concepts described in text, such as gene A inhibits gene B and gene C is involved in disease G. Biomedical knowledge bases containing this type of information are generally products of extensive manual curation, so replacement of manual efforts with automated methods remains a compelling area of research.

Information retrieval and question answering

Biomedical text mining supports applications for identifying documents and concepts matching search queries. Search engines such as PubMed search allow users to query literature databases with words or phrases present in document contents, metadata, or indices such as MeSH. Similar approaches may be used for medical literature retrieval. For more fine-grained results, some applications permit users to search with natural language queries and identify specific biomedical relationships.
On 16 March 2020, the National Library of Medicine and others launched the COVID-19 Open Research Dataset to enable text mining of the current literature on the novel virus. The dataset is hosted by the Semantic Scholar project of the Allen Institute for AI. Other participants include Google, Microsoft Research, the Center for Security and Emerging Technology, and the Chan Zuckerberg Initiative.

Resources

Corpora

The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH. Items marked "Yes" under "Freely Available" can be downloaded from a publicly accessible location.

Corpus Name	Authors or Group	Contents	Freely Available	Citation
2006 i2b2 Deidentification and Smoking Challenge	i2b2	889 de-identified medical discharge summaries annotated for patient identification and smoking status features.	Yes, with registration
2008 i2b2 Obesity Challenge	i2b2	1,237 de-identified medical discharge summaries annotated for presence or absence of comorbidities of obesity.	Yes, with registration
2009 i2b2 Medication Challenge	i2b2	1,243 de-identified medical discharge summaries annotated for names and details of medications, including dosage, mode, frequency, duration, reason, and presence in a list or narrative structure.	Yes, with registration
2010 i2b2 Relations Challenge	i2b2	Medical discharge summaries annotated for medical problems, tests, treatments, and the relations among these concepts. Only a subset of these data records are available for research use due to IRB limitations.	Yes, with registration
2011 i2b2 Coreference Challenge	i2b2	978 de-identified medical discharge summaries, progress notes, and other clinical reports annotated with concepts and coreferences. Includes the ODIE corpus.	Yes, with registration
2012 i2b2 Temporal Relations Challenge	i2b2	310 de-identified medical discharge summaries annotated for events and temporal relations.	Yes, with registration
2014 i2b2 De-identification Challenge	i2b2	1,304 de-identified longitudinal medical records annotated for protected health information.	Yes, with registration
2014 i2b2 Heart Disease Risk Factors Challenge	i2b2	1,304 de-identified longitudinal medical records annotated for risk factors for cardiac artery disease.	Yes, with registration
AIMed	Bunescu et al.	200 abstracts annotated for protein–protein interactions, as well as negative example abstracts containing no protein-protein interactions.	Yes
BioC-BioGRID	BioCreAtIvE	120 full text research articles annotated for protein–protein interactions.	Yes
BioCreAtIvE 1	BioCreAtIvE	15,000 sentences annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms.	Yes
BioCreAtIvE 2	BioCreAtIvE	15,000 sentences annotated for protein and gene names. 542 abstracts linked to EntrezGene identifiers. A variety of research articles annotated for features of protein–protein interactions.	Yes
BioCreative V CDR Task Corpus	BioCreAtIvE	1,500 articles published in 2014 or later, annotated for 4,409 chemicals, 5,818 diseases and 3116 chemical–disease interactions.	Yes
BioInfer	Pyysalo et al.	1,100 sentences from biomedical research abstracts annotated for relationships, named entities, and syntactic dependencies.	No
BioScope	Vincze et al.	1,954 clinical reports, 9 papers, and 1,273 abstracts annotated for linguistic scope and terms denoting negation or uncertainty.	Yes
BioText Recognizing Abbreviation Definitions	BioText Project	1,000 abstracts on the subject of "yeast", annotated for abbreviations and their meanings.	Yes
BioText Protein–Protein Interaction Data	BioText Project	1,322 sentences describing protein–protein interactions between HIV-1 and human proteins, annotated with interaction types.	Yes
Comparative Toxicogenomics Database	Davis et al.	A database of manually-curated associations between chemicals, gene products, phenotypes, diseases, and environmental exposures.	Yes
CRAFT	Verspoor et al.	97 full-text biomedical publications annotated with linguistic structures and biological concepts	Yes
GENIA Corpus	GENIA Project	1,999 biomedical research abstracts on the topics "human", "blood cells", and "transcription factors", annotated for parts of speech, syntax, terms, events, relations, and coreferences.	Yes
FamPlex	Bachman et al.	Protein names and families linked to unique identifiers. Includes affix sets.	Yes
FlySlip Abstracts	FlySlip	82 research abstracts on Drosophila annotated with gene names.	Yes
FlySlip Full Papers	FlySlip	5 research papers on Drosophila annotated with anaphoric relations between noun phrases referring to genes and biologically related entities.	Yes
FlySlip Speculative Sentences	FlySlip	More than 1,500 sentences annotated as speculative or not speculative. Includes annotations of clauses.	Yes
IEPA	Ding et al.	486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins.	No
JNLPBA corpus	Kim et al.	An extended version of version 3 of the GENIA corpus for NER tasks.	No
Learning Language in Logic	Nédellec et al.	77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions.	Yes
Medical Subject Headings	National Library of Medicine	Hierarchically-organized terminology for indexing and cataloging biomedical documents.	Yes
Metathesaurus	National Library of Medicine / UMLS	3.67 million concepts and 14 million concept names, mapped between more than 200 sources of biomedical vocabulary and identifiers.	Yes, with UMLS License Agreement
MIMIC-III	MIT Lab for Computational Physiology	de-identified data associated with 53,423 distinct hospital admissions for adult patients.	Requires training and formal access request
ODIE Corpus	Savova et al.	180 clinical notes annotated with 5,992 coreference pairs.	No
OHSUMED	Hersh et al.	348,566 biomedical research abstracts and indexing information from MEDLINE, including MeSH.	Yes
PMC Open Access Subset	National Library of Medicine / PubMed Central	More than 2 million research articles, updated weekly.	Yes
RxNorm	National Library of Medicine / UMLS	Normalized names for clinical drugs and drug packs, with combined ingredients, strengths, and form, and assigned types from the Semantic Network.	Yes, with UMLS License Agreement
Semantic Network	National Library of Medicine / UMLS	Lists of 133 semantic types and 54 semantic relationships covering biomedical concepts and vocabulary.	Yes, with UMLS License Agreement
SPECIALIST Lexicon	National Library of Medicine / UMLS	A syntactic lexicon of biomedical and general English.	Yes
Word Sense Disambiguation	National Library of Medicine / UMLS	203 ambiguous words and 37,888 automatically extracted instances of their use in biomedical research publications.	Yes, with UMLS License Agreement
Yapex	Franzén et al.	200 biomedical research abstracts annotated with protein names.	No

Word embeddings

Several groups have developed sets of biomedical vocabulary mapped to vectors of real numbers, known as word vectors or word embeddings. Sources of pre-trained embeddings specific for biomedical vocabulary are listed in the table below. The majority are results of the word2vec model developed by Mikolov et al or variants of word2vec.

Set Name	Authors or Group	Contents and Source	Citation
BioASQword2vec	BioASQ	Vectors produced by word2vec from 10,876,004 English PubMed abstracts.
bio.nlplab.org resources	Pyysalo et al.	A collection of word vectors produced by different approaches, trained on text from PubMed and PubMed Central.
BioVec	Asgari and Mofrad	Vectors for gene and protein sequences, trained using Swiss-Prot.
RadiologyReportEmbedding	Banerjee et al.	Vectors produced by word2vec from the text of 10,000 radiology reports.

Applications

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations.

Gene cluster identification

Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.

Protein interactions

Automatic extraction of protein interactions and associations of proteins to functional concepts has been explored. The search engine PIE was developed to identify and return protein-protein interaction mentions from MEDLINE-indexed articles. The extraction of kinetic parameters from text or the subcellular location of proteins have also been addressed by information extraction and text mining technology.

Gene-disease associations

Text mining can aid in gene prioritization, or identification of genes most likely to contribute to genetic disease. One group compared several vocabularies, representations and ranking algorithms to develop gene prioritization benchmarks.

Gene-trait associations

An agricultural genomics group identified genes related to bovine reproductive traits using text mining, among other approaches.

Protein-disease associations

Text mining enables an unbiased evaluation of protein-disease relationships within a vast quantity of unstructured textual data.

Applications of phrase mining to disease associations

A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing, then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.

Software tools

Search engines

Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView, and APSE Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed and OmicsDI.
Some search engines, such as Essie, OncoSearch, PubGene, and GoPubMed were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.

Medical record analysis systems

and electronic health records are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text. Numerous complete systems and tools have been developed to analyse these free-text portions. The MedLEE system was originally developed for analysis of chest radiology reports but later extended to other report topics. The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts. The CLAMP system offers similar functionality with a user-friendly interface.

Frameworks

have been developed to rapidly build tools for biomedical text mining tasks. SwellShark is a framework for biomedical NER that requires no human-labeled data but does make use of resources for weak supervision. The SparkText framework uses Apache Spark data streaming, a NoSQL database, and basic machine learning methods to build predictive models from scientific articles.

APIs

Some biomedical text mining and natural language processing tools are available through application programming interfaces, or APIs. NOBLE Coder performs concept recognition through an API.

Conferences

The following academic conferences and workshops host discussions and presentations in biomedical text mining advances. Most publish proceedings.

Conference Name	Session	Proceedings
Association for Computational Linguistics annual meeting	plenary session and as part of the BioNLP workshop
ACL BioNLP workshop
American Medical Informatics Association annual meeting	in plenary session
Intelligent Systems for Molecular Biology	in plenary session and in the BioLINK and Bio-ontologies workshops
International Conference on Bioinformatics and Biomedicine
International Conference on Information and Knowledge Management	within International Workshop on Data and Text Mining in Biomedical Informatics
North American Association for Computational Linguistics annual meeting	plenary session and as part of the BioNLP workshop
Pacific Symposium on Biocomputing	in plenary session
Practical Applications of Computational Biology & Bioinformatics
Text REtrieval Conference	formerly as part of TREC Genomics track; as of 2018 part of Precision Medicine Track

Journals

A variety of academic journals publishing manuscripts on biology and medicine include topics in text mining and natural language processing software. Some journals, including the Journal of the American Medical Informatics Association and the Journal of Biomedical Informatics are popular publications for these topics.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...