Skip to Main Content
WELCOME TO CREX, THE COLLABORATIVE RESEARCH EXCHANGE FOR THE NIH INTRAMURAL RESEARCH PROGRAM

Go to Main Navigation

Bioinformatics, Metabolomics and Extracting Value From Data

This blog post was written by Metabolon, whose scalable, customizable metabolomics solutions provide support across the research continuum from discovery through clinical trials, to in-market life cycle management.

What is Bioinformatics?

Bioinformatics is the combined discipline of biology and computation and typically focuses on the analysis and manipulation of large, and often highly complex, forms of biological data. Since the genesis of genome sequencing, new data methods covering diverse areas of biology (such as metabolomic, transcriptomic and proteomic data), have become prominent. The amount of data captured per method can vary from tens to thousands of unique molecular signals per sample. Therefore, the need for the development of specialized statistical analysis methods targeting individual data methods has never been greater. Due to the multiple processing steps typically required, often with considerable computational challenges, research groups frequently need to use bioinformaticians to be able to use the data.

What is a Bioinformatician’s Workflow?

The most common role of a bioinformatician is to extract valuable insights from biological data using analytical analysis pipelines. Common pipeline stages could include data processing (cleaning and transformation), data integration and aggregation (adding of an informative external dataset) and data analysis (scientific insight generation). Biological data size can vary from megabytes (a few minutes of music in MP3 format) to petabytes (the entire printed collection of the U.S. Library of Congress – the world’s largest library). This handling and integration of data sources that vary in size is often difficult but forms a crucial part of product development pipelines in the biotechnology and pharmaceutical industries. The primary technologies that drive most of these pipelines are genomics and transcriptomics. Given the long history of these technologies, and considering their constant rapid advancement, it is important to have lab-level understanding of physical sample processing to adequately handle the subsequent data produced. This is because state-of-the-art data types often have quirks related to the peculiarities of the technologies used to generate them for which the utilization of a bioinformatician would be advantageous, as opposed to a more general practitioner, like a data scientist.

Data Sources for Bioinformatics

To aid in overcoming these difficulties, several collaboration initiatives have collectively produced extensive and readily accessible bodies of biological knowledge. These bodies of knowledge are often in the form of open-source databases, most of which can be used in combination with each other to better enable insights. This could include gene-to-disease association and gene-to-pathway association databases, such as mSIGDB1 , or drug response panels such as PharmGKB.2 These data sources have been essential utilities for investigators seeking to get value from large-scale biological experiments. Currently, this knowledge is being increasingly curated and made computationally accessible via APIs, which allows bioinformaticians to create relevant and useful visualizations of meta information that inform the understanding of data. The longevity of genomics technologies also means that their processing and analysis can be standardized for scalability, meaning that over time the extraction of value from them will require less work and eventually may be performed automatically via bioinformatics platforms. However, while these data sources are becoming increasingly vast, and given the often-substantial size of individual data elements, it is evident that a specific research goal must be set before any data processing to achieve a fruitful outcome. This will allow the correct selection of data and appropriate processing to address the specific research goal.

Examples of Targeted Research Goals in Omics Studies

An example of a targeted research goal could be using the data provided to profile and understand disease frequencies within a specific population. This process could involve the creation of models comparing diseased individuals with healthy controls. From this, sets of genes and disease-associated gene variants can be derived that can be investigated to develop an understanding of the disease. These models can range from simple binary comparisons to more complex comparisons with many groups (such as placebos or different disease phenotypes) where elements of the population structure such as age, sex or BMI are considered. Where rarer diseases with lower numbers of available cases are studied, experimental designs exploring the relationship of molecular readouts such as genes or metabolites with the symptoms of disease can be used.

The integration of raw captured data with additional data sources gives more informed context to a given data set and often plays a large part in highlighting key elements of the biological response that are crucial to a given question, such as metabolomic pathway membership, disease association and drug and stimulant control for biases in the analysis of data.

From these examples, it is clear that the analytical methodology is heavily dependent on the data collection process, further highlighting the importance of a well-defined research goal prior to investigation.

Real Industry Examples of Gene-based Technologies Used for Product Development

There are countless examples of the utility that gene-based technologies have brought to product development in the biotechnology and pharmaceutical industries. One of the earliest examples of the power of bioinformatics in target discovery used DNA sequence comparisons (a bioinformatic exercise). Similarities to platelet-derived growth factor (PDGF) helped identify the simian virus oncogene, V-sis.3 This seminal finding led to a broader understanding of the important influence of viral components on gene expression in disease causality (e.g., cancer) and mechanism-based drug discovery programs.4 The continuum from target identification, drug discovery and development and approvals is an expensive proposition estimated to cost over $2.0 billion.5 The combination of bioinformatic tools and the availability of high-quality, gene-based datasets has been essential and critical in the direct support of clinical trials and the successful approval of new drugs.6

The increasing prevalence of established and easily automatable methods ensures the continued place of gene-based technologies in drug discovery and development workflows. Newer omics technologies, such as metabolomics, proteomics, spatial and single-cell methods do not have the same analysis consensus, making their usage more complicated, while also resulting in the continued need for research and development investment to ensure that the correct methods are employed to properly utilize these methods.

Benefits of Metabolomics

There are gaps and disadvantages to established technologies like genomics and transcriptomics that technologies such as metabolomics may be well suited to fill. Much of genomics relies upon the alignment of DNA or RNA molecules to a reference genome. This method inherently discards information not attributable to human systems (unless bespoke pipelines and methods are applied to hunt down this information). Mass spectrometry approaches, such as those used in Metabolon’s technology, allow for the direct quantification of molecules derived from xenobiotic organisms such as bacteria, fungi and viruses. This allows for more direct quantification of their effects without relying on the imprecise proxies for their detection (such as upregulation of immune response genes) often relied upon in transcriptomic and proteomic analysis. This means that metabolomics is better suited than other methods to detect highly context-specific signals that may be instrumental in further developing an understanding of the effects of drugs or diseases. The metabolome represents the integrated cumulative summation of all biological activities directly associated with observed phenotypes, important insights that fall well outside the scope of gene-centric knowledgebase. In addition, comprehensive metabolomic profiling provides chemical identifiers from nutritional, environmental, microbial and other exposures. These profiles have a direct impact on the metabolic status, or metabotype, of an individual, and accounts for the inherent variation of biology observed within a population that cannot otherwise be accounted for by genetic make-up alone.7 The use of metabolomics to establish individual metabotypes, in combination with genetics, provides the most robust information constructs for the personalization of drug treatments in precision medicine.

Metabolomics, more than other omics technologies, enables the detection of by-products of highly specific processes that may only be employed in tightly constrained contexts, may be disease or drug-specific and may either act as highly specific markers of disease or potentially even play causal roles. The unconstrained nature of mass spectrometry means that by-products of all processes may be detected and quantified. This feature of metabolomics leads to its high degree of sensitivity to environmental factors, such as diet, pharmaceutical use and exposure to things like smoke.

This increased sensitivity of metabolomics technology to environmental factors brings benefits to its use in many contexts over gene-based approaches. The particulars of an individual’s diet or lifestyle are very detectable in metabolomic data, while these are not always apparent in transcriptomic data. This can provide valuable extra context to the relationship between lifestyle and disease that might not be discovered using a gene-centric approach. When used together with these gene-centric technologies, metabolomic data can provide a valuable interfacing layer that can be used to quantify what might be driving the effects of lifestyle on the body’s other systems and ultimately, on health.

From a bioinformatics perspective, these advantages can also represent challenges that can have consequences in the analysis and interpretation of metabolomics data while also bringing increased value to experimental analysis. For example, the sensitivity of metabolomics data to lifestyle factors can introduce variation that can make unsupervised style analyses, such as clustering or dimensionality reduction less informative (at least in scenarios where lifestyle factors are already known), as lifestyle factors can dominate the signals. Similarly, the many potential sources of origin for a metabolite in an experiment can lead to a higher knowledge barrier in the analysis of metabolomics data when compared to other omics. There is a greater need in metabolomics for the kind of detailed functional and pathway annotation and ontology information that is easily accessible in genomics through centralized databases such as MSigDB that has been integrated into easily useable APIs and analysis and visualization packages in R and Python.

Discussion on Multiomics

One of the areas of bioinformatics where there remains relatively poor consensus on best approaches is in the analysis of data combining multiple omics technologies. One reason for this is the incredible diversity of omics combinations with which researchers find themselves faced. For many different combinations of omics data types, there are potential distinct methods that can be used to draw useful information from the interface between the two. The binding site analysis made possible by joining transcriptome and methylome data is the most obvious example, but things such as quantifying the effects of mutations on gene expression, and gene expression on protein expression, are also commonly used. Recent metabolomics research has shown methods that relate the presence or absence of gene variants to the levels of key disease metabolites. Omic-technology-agnostic data integration methods are also commonly used in the analysis of multiomics data. These methods typically rely on “embedding” type approaches. These approaches create simpler “stand-in” variables by combining highly correlated variables that represent a single fundamental process. These simpler “stand-in” variables may then represent the strength of a given biological process within an individual.

The above-mentioned advantages of metabolomics are also the key sources of value that they bring to multiomics contexts. Due to the inherent relations between DNA, RNA and protein, there is naturally a high degree of redundancy in signals derived from these sources. Due to the capacity of metabolomics in detecting chemicals derived from the interaction between these processes and the environment, it brings an additional source of information to an experimental analysis. This can be instrumental in providing context to information gleaned from these other technologies, while also providing extremely valuable information itself that is independent of that provided by the other methods. One published example using Metabolon data showed that additional information on the relationship between glycolysis and the immune response in the tumor treatment with azithromycin could be gained by the supplementation of transcriptomics with metabolomics. The integration of metabolomics into the multiomics analyses revealed changes in cell energy metabolism pathways.8

Metabolomics may provide additional value in multiomics contexts by providing more precise proxies for things that are often reductively recorded (by necessity) with simple binary labels in many experimental contexts. Things such as the receipt of analgesics in the previous 24 hours are often used as key variables in modeling. This may be captured through the levels of the metabolites derived from these drugs that are easily detectable in metabolomic data, potentially bringing significant improvements to modeling utilizing other omics technologies as the levels of these metabolites will more accurately represent the state of the processing of these drugs in the body.

Creating a more complete picture of the underlying biology behind a given experiment is one of the key goals of multiomics approaches, where novel ways of layering omics data are applied such that it is possible to get a more complete picture of the biology, free from the deficiencies and biases of any single technology. For more details and depth on the specific ways metabolomic data can be integrated and used in multiomics contexts, please see the accompanying blog post in our bioinformatics series, where we go into more details on the common approaches being used in multiomics integration and analysis.

Final Conclusions

In summary, the sensitivity of metabolomics to xenobiotic and environmental factors, as well as its capacity to detect molecules produced in highly specific contexts, brings valuable additional biological context to the kind of analyses often performed by bioinformaticians, particularly due to the extra layers of information it can bring to an analysis that aren’t easily captured by genomics, transcriptomics or proteomics. This makes metabolomics particularly valuable in the context of multiomics analyses. Further development in the databases and infrastructure surrounding metabolomics in the coming years will further enable easy and effective integration of metabolomics with other omics, cementing its place in biotechnology and pharmaceutical development pipelines.

References
  1. Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. Oct 25 2005;102(43):15545-50. doi:10.1073/pnas.0506580102
  2. Whirl‐Carrillo M, Huddart R, Gong L, et al. An evidence‐based framework for evaluating pharmacogenomics knowledge for personalized medicine. Clinical Pharmacology & Therapeutics. 2021;110(3):563-572.
  3. Doolittle RF, Hunkapiller MW, Hood LE, et al. Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Science. 1983;221(4607):275-277.
  4. Gibbs JB. Mechanism-based target identification and drug discovery in cancer research. Science. 2000;287(5460):1969-1973.
  5. DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: new estimates of R&D costs. Journal of health economics. 2016;47:20-33.
  6. Ochoa D, Karim M, Ghoussaini M, Hulcoop DG, McDonagh EM, Dunham I. Human genetics evidence supports two-thirds of the 2021 FDA-approved drugs. Nat Rev Drug Discov. 2022;21(8):551.
  7. Hillesheim E, Brennan L. Metabotyping: A tool for identifying subgroups for tailored nutrition advice. Proceedings of the Nutrition Society. 2023;82(2):130-141.
  8. Vallet N, Le Grand S, Bondeelle L, et al. Azithromycin promotes relapse by disrupting immune and metabolic networks after allogeneic stem cell transplantation. Blood, The Journal of the American Society of Hematology. 2022;140(23):2500-2513.