Skip to content

python tools for using the Master Chemical Mechanism (MCM)

License

Notifications You must be signed in to change notification settings

jhaskinsPhD/pyMCM

Repository files navigation

pyMCM

About

pyMCM is a set of tools that allows you to rip data from the MCM website into python as a pandas dataframe, query it with RDKit, and add information from published papers about its physical properties. What results is a tool that allows you to get a bunch of info on MCM compounds in an easy to access way in python- for example, like finding all MCM species that are PAN or all that have a tertiary nitrate etc. See script (examples.py for other ways that you might use this!)

Dependencies

Python 3
numpy
pandas
requests
bs4.BeautifulSoup
rdkit ** RDKit can be tricky to install correctly. If using conda, I got the best results using this: https://anaconda.org/conda-forge/rdkit

Scraping Data

  1. Scrape data from the MCM webpage for any given list of species or to scrape data about all of the compounds in the MCM. Given just a list of MCM Names you'd like to get data about, running MCM_data_scraper() will pull down everything from each compound's formula, molecular weight, SMILES, InChI, synonyms, or even the image of that compound.
import pandas as pd 
import numpy as np 
from MCM_web_scraper import *

# Folder where I want to save things... 
pth = 'C:/Users/Jhask/OneDrive/Desktop/fldr/'

# A short list of species we want info about... 
species_list=['APINENE', 'C5H8', 'BPINENE', 'BCARY']

# Scrape info about all the species in our mechanism from the MCM website & NIST.
MCM_df0= MCM_data_scraper(species_list, filename='example_scrape',
                            get_image=True, display=False, savepath=pth)

Viewing Data

  1. The scraper saves your data as a .xlsx excel files (Its easy enough to convert in other formats with pandas, but note that if you do choose to change its output save formation, you must choose a tab delimiter, because InChI codes have commas in them!) and as an html object. Saving it as an HTML object allows you to basically open all the data you scraped as a nice little table in your web browser (which will let you see the images you scraped!)
# Restore that csv data, & make sure to tell Pandas that it's tab delimited! 
df = pd.read_excel(savepath+filename_'.xlsx',engine="openpyxl", index_col=0)

# View the images you scraped using the html doc.
display_MCM_table(pth+'example_scrape.html')

Using RDKit to get more information!

  1. Query your MCM dataset using the rdkit chemical library. Simply run query_rdkit_info() in order to get the # of different functional groups in different MCM compounds. Functional groups are currently matched using a set of defined SMARTS fragments in the /Data/ folder. Please open that up for more info!! (These tend to be atmopsheric science relevant, obviously).
# Pass the scraped dataframe to rdkit and get more info about the molecule from the SMILES strings  
MCM_df1=query_rdkit_info(MCM_df0,  add_functional_groups=True,overwrite_with_RDKIT=True,
                          verbose=True, save=False)

Add COSMO-Therm predicted gas-particle partitioning coefficients

  1. Add information to your MCM dataset from Wang et al., 2017 about the COSMO-Therm predicted gas-particle partitioning coefficients of different MCM compounds.
# Add partitioning coefficient data from Wang et al., 2017 Supplement 
MCM_df2 = add_Wang_et_al_info(MCM_df1,'MCM_Name', save=False)

Add information about MCM Precursors

  1. Get a list of "precursors" to each compound in the MCM. This is stored in a file in /Data/, but is generated by parsing a number of MCM mechanisms and looking at the declared species in each.
# Use precursors file generated above to assign them in our df. 
MCM_df3= assign_precursors(MCM_df2, 'MCM_Name',  savepath=pth, 
                        filename='example_scrape_plus') 

NOTE: All data in Data files are for MCM v 3.3.1 as of 10/29/2021.

About

python tools for using the Master Chemical Mechanism (MCM)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published