Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_wdid() searches all of wikidata, not just chemicals #238

Open
Aariq opened this issue Apr 15, 2020 · 7 comments
Open

get_wdid() searches all of wikidata, not just chemicals #238

Aariq opened this issue Apr 15, 2020 · 7 comments
Assignees
Labels
bug Unexpected problem or unintended behavior

Comments

@Aariq
Copy link
Collaborator

Aariq commented Apr 15, 2020

Currently get_wdid() searches more than just chemicals:

 get_wdid("Horse", verbose = FALSE)
       id match distance query
1 Q869595 Horse        0 Horse

This might be a problem for something that is both a chemical and something else, especially with acronyms like DDT which returns wdids for "Duffy's Tavern Airport" and "Dark Dance Treffen".

However, there is a note in the code that suggests it may be possible to narrow the search:

#! Use SPARQL to search of chemical compounds (P31)?! For a finer / better search?

SPARQL is used in wd_ident() and that's all I know about it!

@Aariq Aariq added the enhancement New feature or enhancement label Apr 15, 2020
@Aariq
Copy link
Collaborator Author

Aariq commented Apr 15, 2020

related to #82

@andschar
Copy link
Contributor

Indeed, I saw the comment about SPARQL also a while ago and started working on functions to improve the wikidata query. I am almost done and will push a PR next week.

@Aariq
Copy link
Collaborator Author

Aariq commented Apr 16, 2020

Wonderful! I'm concurrently working on a PR to standardize input and output of all the get_() functions, and unfortunately I think that get_wdid() is one of the functions I changed the code for the most. (https://github.com/Aariq/webchem/tree/git-consistency).* Maybe take a look and see if you'd rather me go first with my PR?

*"git" was a typo in the branch name. It's supposed to bet "get-consistency".

@Aariq Aariq added bug Unexpected problem or unintended behavior and removed enhancement New feature or enhancement labels Apr 16, 2020
@andschar
Copy link
Contributor

Yes, go ahead and once your PR is merged I change the code within the function, leaving the standardized structure intact.

@Aariq
Copy link
Collaborator Author

Aariq commented Apr 28, 2020

PR #242 is now merged

@andschar
Copy link
Contributor

andschar commented Apr 29, 2020

Great! I will file a PR this or next week as suggested above.

@jvfe
Copy link
Contributor

jvfe commented Oct 2, 2020

Hi @andschar how's the work for this coming along? Being a Wikidata editor, I think I could help out a bit with this one,
if it's not solved yet.

I mostly wanted to chime in to say that searching by item name with "standard" SPARQL is not particularly efficient and would probably time out a lot, see this for reference.

That being said, there is a workaround which uses a mashup of SPARQL and the MediaWiki API, for example:

SELECT ?item ?itemLabel WHERE {
  SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
        wikibase:api "EntitySearch";
        mwapi:search "pyridine";
        mwapi:language "en".
      ?item wikibase:apiOutputItem mwapi:item.
  }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "en". 
  }
  ?item wdt:P31 wd:Q11173 # Guarantees items are 'instances of' a chemical compound
}

Results for this query

The query above would search all item names and aliases for the string "pyridine", while also excluding results that are not "instances of" (P31) "chemical compound" (Q11173), which could help out with unwanted results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants