Skip to content
This repository has been archived by the owner on May 10, 2022. It is now read-only.

Default behaviors for "pit of good practice" of provenance and reproducibility #3

Open
noamross opened this issue Jan 4, 2018 · 0 comments

Comments

@noamross
Copy link
Collaborator

noamross commented Jan 4, 2018

While single command to read-in data is appealing, it is also true that (a) the form of data may vary a great deal, and (b) people often want to have the downloaded file on-hand for other reasons. read_csv(doidata()) or doidata() %>% read_csv() are still fairly minimal and intuitive. Drawing ideas from fulltext::ft_get_si(), here's a scheme for default behavior that I think is still intuitive and drives users towards best practice in maintaining data provenance and credit while having access to downloaded files:

  • doidata("doi/filename") downloads data and always returns a file path to the downloaded data
  • default download location (destfile) is WORKDIR/data/dai_10.123_figshare123/filename. People like to inspect data and use it for other purposes than a single script, so hiding it away in some cache doesn't make sense.
  • doidata maintains an internal database of DOIs, file paths, and file hashes. If there is already a file at the location with the same name, it checks against the hash and just returns the path if they match.
  • If the hashes do not match, it returns an error unless overwrite=TRUE
  • The returned path has attributes with citation information, version information, used to print an informative message with a citation if verbose=TRUE (default)
  • The internal database also keeps the citation/version/origin information, so doidata_cite(file) can return the citation/version information of a file previously downloaded by checking the file hash. doidata_cite(url), doidata_cite(doi), or doidata_cite(doi/filename) all work, too, and work offline for previously-downloaded data.
  • doidata_url("doi/filename") returns the download url for the data

Another question is what appropriate behavior should be for versioned data. For instance, Zenodo and Figshare have DOIs that always points to the latest version of data, and separate DOIs for each version. One possibility:

  • When a latest-version DOI is provided, print a message/warning the includes the fixed-version DOI and suggests user used versioned DOI for reproducibility.
  • Store the fixed-version DOI in the internal database
  • When the latest version DOI no longer matches the fixed version in the internal database, doidata() will error, unless overwrite=TRUE.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant