pyMCM is a set of tools that allows you to rip data from the MCM website into python as a pandas dataframe, query it with RDKit, and add information from published papers about its physical properties. What results is a tool that allows you to get a bunch of info on MCM compounds in an easy to access way in python- for example, like finding all MCM species that are PAN or all that have a tertiary nitrate etc. See script (examples.py for other ways that you might use this!)
Python 3
numpy
pandas
requests
bs4.BeautifulSoup
rdkit ** RDKit can be tricky to install correctly. If using conda, I got the best results using this: https://anaconda.org/conda-forge/rdkit
- Scrape data from the MCM webpage for any given list of species or to scrape data about all of the compounds in the MCM.
Given just a list of MCM Names you'd like to get data about, running
MCM_data_scraper()
will pull down everything from each compound's formula, molecular weight, SMILES, InChI, synonyms, or even the image of that compound.
import pandas as pd
import numpy as np
from MCM_web_scraper import *
# Folder where I want to save things...
pth = 'C:/Users/Jhask/OneDrive/Desktop/fldr/'
# A short list of species we want info about...
species_list=['APINENE', 'C5H8', 'BPINENE', 'BCARY']
# Scrape info about all the species in our mechanism from the MCM website & NIST.
MCM_df0= MCM_data_scraper(species_list, filename='example_scrape',
get_image=True, display=False, savepath=pth)
- The scraper saves your data as a .xlsx excel files (Its easy enough to convert in other formats with pandas, but note that if you do choose to change its output save formation, you must choose a tab delimiter, because InChI codes have commas in them!) and as an html object. Saving it as an HTML object allows you to basically open all the data you scraped as a nice little table in your web browser (which will let you see the images you scraped!)
# Restore that csv data, & make sure to tell Pandas that it's tab delimited!
df = pd.read_excel(savepath+filename_'.xlsx',engine="openpyxl", index_col=0)
# View the images you scraped using the html doc.
display_MCM_table(pth+'example_scrape.html')
- Query your MCM dataset using the rdkit chemical library. Simply run
query_rdkit_info()
in order to get the # of different functional groups in different MCM compounds. Functional groups are currently matched using a set of defined SMARTS fragments in the/Data/
folder. Please open that up for more info!! (These tend to be atmopsheric science relevant, obviously).
# Pass the scraped dataframe to rdkit and get more info about the molecule from the SMILES strings
MCM_df1=query_rdkit_info(MCM_df0, add_functional_groups=True,overwrite_with_RDKIT=True,
verbose=True, save=False)
- Add information to your MCM dataset from Wang et al., 2017 about the COSMO-Therm predicted gas-particle partitioning coefficients of different MCM compounds.
# Add partitioning coefficient data from Wang et al., 2017 Supplement
MCM_df2 = add_Wang_et_al_info(MCM_df1,'MCM_Name', save=False)
- Get a list of "precursors" to each compound in the MCM. This is stored in a file in /Data/, but is generated by parsing a number of MCM mechanisms and looking at the declared species in each.
# Use precursors file generated above to assign them in our df.
MCM_df3= assign_precursors(MCM_df2, 'MCM_Name', savepath=pth,
filename='example_scrape_plus')
NOTE: All data in Data files are for MCM v 3.3.1 as of 10/29/2021.