Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better detection test for whether a file is ingested #113

Closed
kuriwaki opened this issue Jan 7, 2022 · 7 comments · Fixed by #114
Closed

Better detection test for whether a file is ingested #113

kuriwaki opened this issue Jan 7, 2022 · 7 comments · Fixed by #114
Milestone

Comments

@kuriwaki
Copy link
Member

kuriwaki commented Jan 7, 2022

The current method to detect whether something is_ingested, introduced in v0.3.0 is problematic: It only checks if there is a metadata file associated with the fileid. But I guess some files, e.g. those that have ingestion warnings, don't have a metadata file. This can cause the wrong download format as in #80.

If I have a dataset id or name, I now know how to check whether something is ingested: check if the entry originalFileFormat exists (e.g. this JSON).

However, in the particular stage of the client, I sometimes don't have a dataset identifier, only the numeric fileid + server. This happens for example with get_*_by_doi where the user only provides a file DOI. @landreev pointed out that the Dataverse api/files API apparently does not contain info like originalFileFormat, perhaps for legacy reasons.

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand? (@pdurbin ?). In the above example, how would I obtain the dataset iddoi:10.70122/FK2/PPIAXE only by knowing file id=1734017 and server = demo.dataverse.org?

@kuriwaki kuriwaki changed the title Better method for is_ingest Better detection test for whether a file is ingested Jan 7, 2022
@pdurbin
Copy link
Member

pdurbin commented Jan 7, 2022

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

My first thought is to get https://demo.dataverse.org/api/search?q=fileId:1734017 and find a dataset_persistent_id of "doi:10.70122/FK2/PPIAXE".

@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 8, 2022

@pdurbin this looks promising. Our function dataverse_search() could possibly mimic this. But when I tried searching for fileId=3123547 , which I expected to be this CCES file, I got something completely different: https://dataverse.harvard.edu/api/search?q=fileId:3123547. Do you know why this occurs, and how to fix the query so I get the CCES file instead?

Here is the query confirming that at least the id of the CCES file of interest is 3123547.
https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/GDF6Z0

even though this is a search, for the purpose of this issue I'd need to have it be a strict match on the file id. (that is, return a single entry if the file id exists, and return 0 results if the file id does not exist).

@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 8, 2022

Add-on Re:

For now, what is the best way to access the parent dataset JSON with only the numeric file in hand?

We would also need to have a method that can get the dataset JSON with only the file DOI (persistentID) in hand. (to use in get_*_by_doi). Using the same example of dataset id: doi:10.70122/FK2/PPIAXE, we'd want to know the dataset id with only persistentId=doi:10.70122/FK2/PPIAXE/MHDB0O and the server.

@pdurbin
Copy link
Member

pdurbin commented Jan 12, 2022

@kuriwaki huh. fileId works on the demo server but not for Harvard Dataverse nor my dev laptop. Can you please try id:datafile_NNN instead, like the example below?

https://dataverse.harvard.edu/api/search?q=id:datafile_3123547

@pdurbin pdurbin self-assigned this Jan 12, 2022
@pdurbin
Copy link
Member

pdurbin commented Jan 12, 2022

I don't think that "MHDB0O" file is indexed. https://dataverse.harvard.edu/api/search?q=id:datafile_1734017 should find it but it doesn't. Can you please open an issue in https://github.com/IQSS/dataverse.harvard.edu/issues about this?

For a file that is properly indexed, like the CCES file we've been talking about ( https://dataverse.harvard.edu/api/search?q=id:datafile_3123547 ), you should be able to search for it by DOI like this (not the quotes around the DOI): https://dataverse.harvard.edu/api/search?q=filePersistentId:%22doi:10.7910/DVN/GDF6Z0/JPMOZZ%22

@kuriwaki
Copy link
Member Author

kuriwaki commented Jan 12, 2022

(for numeric id's)

id:datafile_NNN

This is great. The following three examples work as intended - they give me the single entry. I will try implementing it on dev.

library(dataverse)

#  rds
dataverse_search(id = "datafile_1734017", server = "demo.dataverse.org", type = "file")$name

# CCES problematic dta
dataverse_search(id = "datafile_3123547", server = "dataverse.harvard.edu", type = "file")$name

# other dataverse
dataverse_search(id = "datafile_204446", server = "dataverse.nl", type = "file")$name

@kuriwaki
Copy link
Member Author

I don't think that "MHDB0O" file is indexed.

That actually came from the demo dataverse, not Harvard dataverse. This one works great: https://demo.dataverse.org/api/search?q=id:datafile_1734017

For a file that is properly indexed, like the CCES file we've been talking about, you should be able to search for it by DOI like this (note the quotes around the DOI)

Thank you. This seems to work in the two examples below, with the quotes escaped

# CCES
dataverse_search(filePersistentId = "\"doi:10.7910/DVN/GDF6Z0/JPMOZZ\"", server = "dataverse.harvard.edu")$name

# demo.dataverse
dataverse_search(filePersistentId = "\"doi:10.70122/FK2/HXJVJU/SA3Z2V\"", server = "demo.dataverse.org")$name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants