Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract PubChem section's content #385

Open
francoiskroll opened this issue Nov 21, 2022 · 2 comments
Open

extract PubChem section's content #385

francoiskroll opened this issue Nov 21, 2022 · 2 comments
Labels
enhancement New feature or enhancement

Comments

@francoiskroll
Copy link

francoiskroll commented Nov 21, 2022

Thanks for an amazing package! Incredibly helpful.

I tagged this as "bug" but it might be a "Database suggestion", depends on the answer...

As example, let's take the PubChem page for aspirin (CID=2244), section "Associated Disorders and Diseases": https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=Associated-Disorders-and-Diseases

Is there any way I can extract 'useful' data from this sort of sections? Namely the list of diseases here.

I tried

pc_sect(id='2244',
        section='Associated Disorders and Diseases',
        domain='compound')

It does run, but it returns:

# A tibble: 3 × 5
  CID   Name    Result               SourceName                                SourceID         
  <chr> <chr>   <chr>                <chr>                                     <chr>            
1 2244  Aspirin ctd_chemical_disease Comparative Toxicogenomics Database (CTD) D001241::Compound
2 2244  Aspirin collection=ttd_dd    Therapeutic Target Database (TTD)         D07DPI           
3 2244  Aspirin collection=ttd_dd    Therapeutic Target Database (TTD)         D0GY5Z  

which is not really I am interested in...

Am I missing something?

Thanks!

@stitam
Copy link
Contributor

stitam commented Nov 24, 2022

Many thanks @francoiskroll for raising this issue! You are not missing anything, the reason you see this response is because this data field is not handled appropriately by the pc_sect() function.

PubChem pages are quite complex in terms of data structure, we can resolve some fields but we still need to figure out others. It seems some of data fields we see on the website do not "live" on the webpage and the website only points to and renders data from another data source. Because of this e.g. figures in general are also a problem for pc_sect().

I will mark this as "enhancement" and try to allocate time to fix it, but I cannot promise this will be resolved soon. If you can come up with a solution we would be more than happy to incorporate it in the package!

In case you want to work on a solution yourself: the pc_sect() function is a convenience wrapper around two other functions which are not exported: pc_page() downloads the section of the PubChem page in a convenient format and pc_extract() attempts to further extract the data from it.

@stitam stitam added the enhancement New feature or enhancement label Nov 24, 2022
@francoiskroll
Copy link
Author

Thanks a lot for your answer. Ok great, I might look into it.

Not an answer in the context of your package, but if it's useful for another user: a solution I found for specifically the data from the Therapeutic Target Database (TTD) is to download the data from them:
http://idrblab.net/web/full-data-download; Drug to disease mapping with ICD identifiers.

The file is small (few Mb) and the format is fairly simple. It uses TTD IDs though, so you will need to convert the PubChem CIDs (or whatever you use). Luckily, TTD provides the necessary data as well:
http://idrblab.net/web/full-data-download; Cross-matching ID between TTD drugs and public databases.

Happy to share more details/code if it's useful to anyone. Get in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants