extract PubChem section's content #385

francoiskroll · 2022-11-21T20:23:22Z

Thanks for an amazing package! Incredibly helpful.

I tagged this as "bug" but it might be a "Database suggestion", depends on the answer...

As example, let's take the PubChem page for aspirin (CID=2244), section "Associated Disorders and Diseases": https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=Associated-Disorders-and-Diseases

Is there any way I can extract 'useful' data from this sort of sections? Namely the list of diseases here.

I tried

pc_sect(id='2244',
        section='Associated Disorders and Diseases',
        domain='compound')

It does run, but it returns:

# A tibble: 3 × 5
  CID   Name    Result               SourceName                                SourceID         
  <chr> <chr>   <chr>                <chr>                                     <chr>            
1 2244  Aspirin ctd_chemical_disease Comparative Toxicogenomics Database (CTD) D001241::Compound
2 2244  Aspirin collection=ttd_dd    Therapeutic Target Database (TTD)         D07DPI           
3 2244  Aspirin collection=ttd_dd    Therapeutic Target Database (TTD)         D0GY5Z

which is not really I am interested in...

Am I missing something?

Thanks!

The text was updated successfully, but these errors were encountered:

stitam · 2022-11-24T07:55:26Z

Many thanks @francoiskroll for raising this issue! You are not missing anything, the reason you see this response is because this data field is not handled appropriately by the pc_sect() function.

PubChem pages are quite complex in terms of data structure, we can resolve some fields but we still need to figure out others. It seems some of data fields we see on the website do not "live" on the webpage and the website only points to and renders data from another data source. Because of this e.g. figures in general are also a problem for pc_sect().

I will mark this as "enhancement" and try to allocate time to fix it, but I cannot promise this will be resolved soon. If you can come up with a solution we would be more than happy to incorporate it in the package!

In case you want to work on a solution yourself: the pc_sect() function is a convenience wrapper around two other functions which are not exported: pc_page() downloads the section of the PubChem page in a convenient format and pc_extract() attempts to further extract the data from it.

francoiskroll · 2022-11-24T11:31:01Z

Thanks a lot for your answer. Ok great, I might look into it.

Not an answer in the context of your package, but if it's useful for another user: a solution I found for specifically the data from the Therapeutic Target Database (TTD) is to download the data from them:
http://idrblab.net/web/full-data-download; Drug to disease mapping with ICD identifiers.

The file is small (few Mb) and the format is fairly simple. It uses TTD IDs though, so you will need to convert the PubChem CIDs (or whatever you use). Luckily, TTD provides the necessary data as well:
http://idrblab.net/web/full-data-download; Cross-matching ID between TTD drugs and public databases.

Happy to share more details/code if it's useful to anyone. Get in touch.

stitam added the enhancement New feature or enhancement label Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract PubChem section's content #385

extract PubChem section's content #385

francoiskroll commented Nov 21, 2022 •

edited

Loading

stitam commented Nov 24, 2022 •

edited

Loading

francoiskroll commented Nov 24, 2022

extract PubChem section's content #385

extract PubChem section's content #385

Comments

francoiskroll commented Nov 21, 2022 • edited Loading

stitam commented Nov 24, 2022 • edited Loading

francoiskroll commented Nov 24, 2022

francoiskroll commented Nov 21, 2022 •

edited

Loading

stitam commented Nov 24, 2022 •

edited

Loading