PubChem chemical structure standardization

PubChem chemical structure standardization

Volker D. Hähnke, Sunghwan Kim & Evan E. Bolton
Journal of Cheminformatics volume 10, Article number: 36 (2018)

Background: PubChem is a chemical information repository, consisting of three primary databases: Substance,
Compound, and BioAssay. When individual data contributors submit chemical substance descriptions to Substance,
the unique chemical structures are extracted and stored into Compound through an automated process called
structure standardization. The present study describes the PubChem standardization approaches and analyzes them
for their success rates, reasons that cause structures to be rejected, and modifcations applied to structures during
the standardization process. Furthermore, the PubChem standardization is compared to the structure normalization
of the IUPAC International Chemical Identifer (InChI) software, as manifested by conversion of the InChI back into a
chemical structure.

Results: The observed rejection rate for substances processed by PubChem standardization was 0.36%, which is
predominantly attributed to structures with invalid atom valences that cannot be readily corrected without additional
information from contributors. Of all structures that pass standardization, 44% are modifed in the process, reducing
the count of unique structures from 53,574,724 in substance to 45,808,881 in compound as identifed by de-aromatized canonical isomeric SMILES. Even though the processing time is very low on average (only 0.4% of structures
have individual standardization time above 0.1 s), total standardization time is completely dominated by edge cases:
90% of the time to standardize all structures in PubChem substance is spent on the 2.05% of structures with the highest individual standardization time. It is worth noting that 60% of the structures obtained from PubChem structure
standardization are not identical to the chemical structure resulting from the InChI (primarily due to preferences for a
diferent tautomeric form).

Conclusions: Standardization of chemical structures is complicated by the diversity of chemical information
and their representations approaches. The PubChem standardization is an efective and efcient tool to account
for molecular diversity and to eliminate invalid/incomplete structures. Further development will concentrate on
improved tautomer consideration and an expanded stereocenter defnition. Modifcations are difcult to thoroughly
validate, with slight changes often afecting many thousands of structures and various edge cases. The PubChem
structure standardization service is accessible as a public resource (https://pubchem.ncbi.nlm.nih.gov/standardize),
and via programmatic interfaces

Information
Content Type OER
Author Volker D. Hähnke, Sunghwan Kim, Evan E. Bolton
DOI https://doi.org/10.1186/s13321-018-0293-8
Content Link https://jcheminf.biomedcentral.com/track/pdf/10.1186/s13321-018-0293-8
License Open Access
Content Status publish
Number of Comments No Comments
Date Published August 10, 2018
Content Tags Cheminformatics, Content type, Data Extraction, Data Management, InChI Applications, InChI Key, Publication, Search, Software, Toolkits