Computational Approaches to Support Identification of Chemicals in the Environment

Co-authored by Andrew D. McEachran and Antony J. Williams
Published in Research Data
Computational Approaches to Support Identification of Chemicals in the Environment

Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

The number of chemicals detected in the environment continues to increase. These range from expected pollutants such as pesticides and pharmaceuticals (for example, opioids and cannabinoids) to metabolites and degradants. The rapid identification of small molecules in environmental monitoring studies generally utilizes high resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) techniques. NTA analysis generally combines the acquisition of HRMS spectral signatures for hundreds to thousands of chemicals with informatics approaches that perform searches against databases containing “known” chemicals. 


Freely available public online databases can contain 10s of millions of chemicals (for example the PubChem and ChemSpider databases contain 96 million and 74 million substances, respectively, as of August 2019). While these large databases are useful for broad chemical searching, more focused databases are better-suited for identifying chemicals in the environment. At the US-EPA we have been building a more focused data collection to support our computational toxicology research for almost 20 years (the DSSTox Database []) and it now contains over 875,000 substances (as of August 2019). The “CompTox Chemicals Dashboard” ( is a freely available web interface accessing the data contained in DSSTox and has specific functionality that can support our mass spectrometry analyses and the identification of “known unknowns” (


When attempting to identify an unknown chemical in an environmental sample, most search techniques use either a generated molecular formula or an observed molecular mass to determine what are potential candidate chemicals for that unknown. In many cases tens to hundreds of chemicals can match a molecular formula or mass within the database. For example, the chemical formula for Bisphenol A (or BPA that many of us will know from the emphasis on “BPA-free” in commerce) corresponds to over 200 chemicals out of the collection of 875k substances ( The challenge is how to identify which of these chemicals is a more likely “candidate”. One of the approaches that has proven to be of value to date is “metadata ranking” ( that uses available data such as the number of consumer products containing the chemical, or the number of scientific articles in PubMed mentioning the article, to prioritize the candidates.  


To further increase the confidence in an identification beyond metadata, researchers use spectral “fragmentation patterns” (how a chemical structure breaks apart in a high energy collision) to match what was observed on an analytical instrument to what has previously been observed for that same structure.  These data, when available, can boost the confidence in identifying chemicals and there are an increasing number of freely available spectral databases available online (for example, MassBank (  However, overall there is low availability of fragmentation data, limiting generalized high-throughput application in routine identifications.  The goal in our reported work ( was to fill a crucial gap by predicting and storing the fragmentation patterns of the entirety of the EPA’s DSSTox database to enable easy access to both the rich metadata and fragmentation patterns for broad, high-throughput use to boost confidence in chemical identifications.  We hope that individuals, research groups, and analytical chemistry vendors will find the data of value, informative, and effective.


Disclaimer: The views expressed in this paper are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Subscribe to the Topic

Research Data
Research Communities > Community > Research Data

Related Collections

With collections, you can get published faster and increase your visibility.

Medical imaging data for digital diagnostics

This Collection presents a series of articles describing annotated datasets of medical images and video. All medical specialities are considered and data can be derived from study participants, tissue samples, electronic health records (EHRs) or other sources.

Publishing Model: Open Access

Deadline: Dec 20, 2023

Remote sensing data for changes in land use

This Collection comprises a series of articles presenting data on changes to land use in urban areas, farmland, forests, and natural environments, as determined using remote sensing techniques.

Publishing Model: Open Access

Deadline: Jan 31, 2024