Molecular Fingerprints with Persistent Homology for Machine Learning Applications in Chemistry

A new molecular representation based on persistent homology, an applied branch of topology, for efficient screening of large molecular databases.
Published in Chemistry
Molecular Fingerprints with Persistent Homology for Machine Learning Applications in Chemistry

Share this post

Choose a social network to share with, or copy the shortened URL to share elsewhere

This is a representation of how your post may appear on social media. The actual post will vary between social networks

Machine learning applications for chemical problems are rapidly increasing the past few years. Their popularity is justified since they have led to the discovery of new molecules and materials with enhanced properties, new reactions, or have contributed to the reduction of computational effort needed of complex calculations and simulations. These are just a few examples about the success of data-driven approaches in chemistry but an open question that has not fully yet addressed is how a computational algorithm can efficiently “read” and “learn” patterns from molecular structures.

 In this collaborative work between Jacob Townsend, John Hymel and Konstantinos Vogiatzis (Chemistry, University of Tennessee) and Cassie Micucci and Vasileios Maroulas (Mathematics, University of Tennessee), we are presenting a novel molecular representation method based on persistent homology, an applied branch of topology, which encodes the atomistic structure of molecules. Precisely, a molecule is mapped into a persistence diagram, a two-dimensional point summary, which demystifies the connected components and the empty space that exist in a molecule based on the atom types and the distances among them. A persistence diagram is further vectorized to a persistence image (PI), a weighted representation of the diagram, which captures the chemically driven uncertainty. The PI in that sense is a “molecular fingerprint”, and when used with machine learning, offers an efficient and reliable approach to screen large molecular databases when compared to other popular molecular representation schemes. The efficiency arises from the low computation effort needed to compare a large number of fingerprints, and the similar-size representations that are generated, independently of the molecular sizes.

We demonstrated the applicability of the PI method by screening a large molecular database (GDB-9) with 133,885 organic molecules. Our target was to identify novel molecular units that selectively interact with COand can be used as building blocks of materials, such as polymeric membranes. We began our study by computing with density functional theory (DFT) the CO2 interaction energies of 100 organic molecules. Since the initial, limited 100 data points were not capturing the diversity of the GDB-9 database, we applied a technique called active learning in order to incrementally obtain data which helped us efficiently screen the 133,885 molecules. We found out that the combination of PIs with active learning performed well with data (interaction energies) from only 220 molecules in order to identify new molecules with stronger CO2 binding. Finally, our data-driven methodology was able to identify molecular patterns previously unknown to us that increase the CO2 affinity of organic molecules.

The details of the PI molecular representation can be found on the article published in Nature Communications, 11, 3230.

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in

Follow the Topic

Physical Sciences > Chemistry

Related Collections

With collections, you can get published faster and increase your visibility.

Cancer and aging

This cross-journal Collection invites original research that explicitly explores the role of aging in cancer and vice versa, from the bench to the bedside.

Publishing Model: Hybrid

Deadline: Jul 31, 2024

Applied Sciences

This collection highlights research and commentary in applied science. The range of topics is large, spanning all scientific disciplines, with the unifying factor being the goal to turn scientific knowledge into positive benefits for society.

Publishing Model: Open Access

Deadline: Ongoing