submission: piggyback #220

cboettig · 2018-05-30T20:43:30Z

Summary

Allow large and binary data files to "piggyback" on top of your existing repositories. push and pull large-ish (< 2GB) data files to & from GitHub repositories as attachments to a GitHub release;
Paste the full DESCRIPTION file inside a code block below:

Package: piggyback
Version: 0.0.0.9000
Title: Managing Larger Data on a GitHub Repository
Description: Because larger (> 50 MB) data files cannot easily be committed to git,
  a different approach is required to manage data associated with an analysis in a
  GitHub repository.  This package provides a simple work-around by allowing larger
  (up to 2 GB) data files to piggyback on a repository as assets attached to individual
  GitHub releases.  These files are not handled by git in any way, but instead are
  uploaded, downloaded, or edited directly by calls through the GitHub API. These
  data files can be versioned manually by creating different releases.  This approach
  works equally well with public or private repositories.  Data can be uploaded
  and downloaded programmatically from scripts. No authentication is required to
  download data from public repositories.
Authors@R: person("Carl", "Boettiger",
                  email = "cboettig@gmail.com",
                  role = c("aut", "cre", "cph"),
                  comment=c(ORCID = "0000-0002-1642-628X"))
URL: https://github.com/cboettig/piggyback
BugReports: https://github.com/cboettig/piggyback/issues
License: GPL-3
Encoding: UTF-8
LazyData: true
ByteCompile: true
Imports:
    gh,
    httr,
    jsonlite,
    git2r,
    fs,
    usethis,
    crayon,
    clisymbols
Suggests:
    readr,
    covr,
    testthat,
    datasets,
    knitr,
    rmarkdown
VignetteBuilder: knitr
RoxygenNote: 6.0.1.9000
Roxygen: list(markdown = TRUE)

URL for the package (the development repository, not a stylized html page):

https://github.com/cboettig/piggyback

Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):

reproducibility, because accessing data being analyzed is essential for reproducible workflows, and yet we have no good solution for workflows with unpublished data or private workflows to do this once the data is too large for version control (e.g. files > 50 mb).

Who is the target audience and what are scientific applications of this package?

The target audience is anyone working with data files on GitHub.

Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?

datastorr on ropenscilabs is the closest match, which takes a very different approach (from the user perspective -- on the back end both store data on GitHub assets) to the essentially the same problem. The Intro vignette discusses at greater length many of the alternative possible strategies and why I feel they have all fallen short of my needs and led to me creating this package.

If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Requirements

Confirm each of the following by checking the box. This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.
includes documentation with examples for all functions.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage, using services such as Travis CI, Coveralls and/or CodeCov.
I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

Do you intend for this package to go on CRAN?
Do you wish to automatically submit to the Journal of Open Source Software? If so:
- The package has an obvious research application according to JOSS's definition.
- The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
- The package is deposited in a long-term repository with the DOI:
- (Do not submit your package separately to JOSS)
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:
- The package is novel and will be of interest to the broad readership of the journal.
- The manuscript describing the package is no longer than 3000 words.
- You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
- (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no gaurantee that your manuscript willl be within MEE scope.)
- (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
- (Please do not submit your package separately to Methods in Ecology and Evolution)

Detail

Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

No errors, notes, or warnings.

Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

Rich FitzJohn, @richfitz, would be great based on his experience in this area and with datastorr. Jenny Bryan, @jennybc, since this package makes heavy use of usethis and GitHub interactions.

The text was updated successfully, but these errors were encountered:

annakrystalli · 2018-06-11T20:38:52Z

Editor checks:

Fit: The package meets criteria for fit and overlap
Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
License: The package has a CRAN or OSI accepted license
Repository: The repository link resolves correctly
Archive (JOSS only, may be post-review): The repository DOI resolves correctly
Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Hi @cboettig 👋, thanks for submission and great to be working with another of your packages. I believe I remember you talking about this problem. A nice solution to it!

Anyways, see below the result of my initial checks. I'm sure it's nothing that can't be fixed so I'm going to go ahead and approach reviewers. I'll take your suggestions onboard.

One small suggestion on the README, for storing GITHUB_PAT system variable, using edit_r_environ() and storing it there is also a nice suggestion.

checks

OK apart from 1 failling test.

tests

One test failing for me. Here's the relevant output: Any ideas?

Testing piggyback
✔ | OK F W S | Context
✖ |  2 1   1 | Requiring Authentication [1.5 s]
──────────────────────────────────────────────────────────────────────────
test-with-auth.R:10: error: We can upload data
GitHub API error (404): 404 Not Found
  Not Found
1: pb_upload(repo = "cboettig/piggyback", file = "iris.tsv.gz", tag = "v0.0.1",
       overwrite = TRUE) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/tests/testthat/test-with-auth.R:10
2: gh("DELETE /repos/:owner/:repo/releases/assets/:id", owner = x$owner, repo = x$repo,
       id = ids[i], .token = .token) at /Users/Anna/Documents/workflows/rOpenSci/editorials/piggyback/R/gh.R:157
3: gh_process_response(raw)

══ Results ═══════════════════════════════════════════════════════════════
Duration: 9.2 s

OK:       15
Failed:   1
Warnings: 0
Skipped:  1

I believe in you!

spell check

flagging potential typos:

  WORD               FOUND IN
commitish          pb_new_release.Rd:17
delimeters         pb_list.Rd:26
gess               pb_delete.Rd:10, pb_upload.Rd:13
prerelease         pb_new_release.Rd:32
transfered         pb_pull.Rd:27

coverage

Unable to check because of test failure.

annakrystalli · 2018-06-11T20:48:06Z

Also, could you also add the review badge to the README? 👍

[![](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci/onboarding/issues/220)

cboettig · 2018-06-11T21:56:10Z

@annakrystalli Thanks!

I just learned about devtools::spell_check(), very cool. spelling errors fixed .
review badge added
I added a mention of usethis::edit_r_environ() to README
That test should be getting skipped if you're running locally, because you don't have my GitHub token... I'm guessing you have your own GITHUB_TOKEN or GITHUB_PAT already set? Can you try running checks without a .Renviron file that sets either env var?

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token. Note the coverage will be lower than what codecov shows if you test locally due to these skips.

annakrystalli · 2018-06-12T07:43:37Z

Aha 🤔, not sure how feasible it is but it feels like the only people that test should run for are the repo owner and any collaborators. So perhaps the test condition check could include a check that the gh user matches one of owner or collaborator?

Also, all tests passing after disabling GITHUB_PAT 👌

annakrystalli · 2018-06-12T07:47:12Z

Also, current coverage,

piggyback Coverage: 0.36%
R/push-pull.R: 0.00%
R/utils.R: 0.00%
R/gh.R: 0.77%

As you mentioned, some tests didn't run. What sort of coverage are you getting?

annakrystalli · 2018-06-12T07:51:33Z

Sorry just found link to codecov. Ignore comment above.

richfitz · 2018-06-12T07:58:24Z

I'm not sure how to change the skip command to handle that case; currently it just skips the authentication-requiring tests if the token is undefined ("") (i.e. so it will skip on CRAN; travis runs these tests since I can give travis an encrypted token

@cboettig - I use "opt in" environment variables for tests like this, something like <PACKAGENAME>_USE_<FEATURE> and set that in my .Renviron and .travis.yml, then in the helper functions in test that have an equivalent skip_if_not_<feature> function that skips the test if the variable is not set to true

mpadge · 2018-06-12T08:49:03Z

Or universal local ones: I just set MPADGE_LOCAL=true. If I've got that, then GITHUB_PAT must be mine.

cboettig · 2018-06-12T16:47:54Z

Thanks all, that's brilliant. An after a few boneheaded attempts I managed to pass the secure+public env vars to travis correctly this time.

coatless · 2018-06-13T23:32:32Z

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Vignette(s) demonstrating major functionality that runs successfully locally
Function Documentation: for all exported functions in R help
Examples for all exported functions in R Help that run successfully locally
Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

For packages co-submitting to JOSS

The package has an obvious research application according to JOSS's definition

The package contains a paper.md matching JOSS's requirements with:

A short summary describing the high-level functionality of the software

Authors: A list of authors with their affiliations

A statement of need clearly stating problems the software is designed to solve and its target audience.

References: with DOIs for all those that have one (e.g. papers, datasets, software).

Functionality

Installation: Installation succeeds as documented.
Functionality: Any functional claims of the software been confirmed.
Performance: Any performance claims of the software been confirmed.
Automated tests: Unit tests cover essential functions of the package
and a reasonable range of inputs and conditions. All tests pass on the local machine.
Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 1.5hrs

Review Comments

Environment Variable Access

The environment variable for the Personal Access Token (PAT), GITHUB_TOKEN, is not the common way the PAT is stored across development packages. Generally, the PAT in environment variables is given as GITHUB_PAT. Thus, an additional system environment check is warranted.

Note, that gh and devtools both prefer GITHUB_PAT.

https://github.com/r-lib/gh/blob/8677f7d287ef76c33f33831ea09a6df89b31f88e/R/gh_request.R#L143-L146

https://github.com/r-lib/devtools/blob/26c507b128fdaa1911503348fedcf20d2dd30a1d/R/github.R#L86-L106

I think it would also be advantageous to mimic the github_pat function from devtools in place of get_token():

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/gh.R#L281-L283

Package API

The API for the package dances back'n'forth between treating the repository as the data and the data as a file. I can understand the use-cases for both; but, would it be better practice to emphasize the repository as the data? This would simplify the interface to mimic pre-existing APIs, e.g. devtools::install_* and download.files(url, dest). Alternatively, consider making the API more verbose, e.g. pb_download_repo(repo) or pb_download_repo_file(repo, file)

Package API:

  library("piggyback")
  lsf.str("package:piggyback")
#> pb_delete : function (repo = guess_repo(), tag = "latest", file, .token = get_token())  
#> pb_download : function (file = NULL, dest = ".", repo = guess_repo(), tag = "latest", 
#>     overwrite = TRUE, ignore = "manifest.json")  
#> pb_download_url : function (file = NULL, repo = guess_repo(), tag = "latest")  
#> pb_list : function (repo = guess_repo(), tag = "latest", ignore = "manifest.json")  
#> pb_new_release : function (repo = guess_repo(), tag, commit = "master", name = tag, 
#>     body = "Data release", draft = FALSE, prerelease = FALSE, .token = get_token())  
#> pb_pull : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_push : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")  
#> pb_track : function (glob)  
#> pb_upload : function (file, repo = guess_repo(), tag = "latest", name = NULL, 
#>     overwrite = FALSE, .token = get_token())

Function Parameters

Another notable issue is the placement of the positional parameter after defaults.

e.g. file in pb_delete

pb_delete(repo = guess_repo(), tag = "latest", file, .token = get_token())

Consider putting file first or indicating missingness with file = NA.

Function Exports

It may be beneficial to also export and document guess_repo() as that is prominently featured in function signatures.

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/push-pull.R#L261-L268

Documentation

Outside of that, it seems like there are a few artifact of an older API in the documentation examples and some exported functions are missing documentation:

Missing ex:

Outdated:

https://github.com/cboettig/piggyback/blob/5c92c8753eda0c50f721ee2b39d0f03c8bc24b80/R/gh.R#L132

annakrystalli · 2018-06-14T10:58:27Z

Wow! I'm sure this might be some sort of record @coatless! 👏 Thanks so much for such a super speedy response!

@cboettig I'm still looking for the second reviewer. Feel to wait and respond to both reviews together or respond to @coatless anytime before.

coatless · 2018-06-14T14:25:13Z

@annakrystalli apologies for jumping the gun. I had a use case in mind when I saw the package. So, I just quickly took down a few notes.

annakrystalli · 2018-06-14T20:03:54Z

No worries at all @coatless! I'm not going to complain about having your review in! In fact, having a relevant use case handy often makes for a really useful review.

cboettig · 2018-06-15T20:37:07Z

@coatless thanks, this is great! How'd your use case go?

I've fixed the missing and outdated examples. I've made most of the examples into donttest and made sure they run successfully with devtools::check(cran = FALSE, run_dont_test = TRUE), though that requires my PAT since the tests upload data to cboettig/piggyback. Not sure if there's a better strategy for documentation than that -- obviously this means copy-pasting examples won't work, but examples need a repo they can upload to. Ideas / suggestions welcome.

Re the "file" vs "repo" interface, that's very intentional -- but perhaps I need to document it better. If I'm just trying to debug an example on some data, I'd rather use a file-based interface to quickly share the file over, say, a generic "scratch" repository (regardless of my current working repo, or if I'm not "in" any repo). Locally I can do:

pb_upload("mtcars.tsv.gz", repo  = "cboettig/scratch")
pb_download_url("mtcars.tsv.gz", repo  = "cboettig/scratch")

Then I can post a question like: "Hey Jim, why does this fail:"

read.table("https://github.com/cboettig/scratch/releases/download/data/mtcars.tsv.gz")

This needs the more file-based workflow of upload/download/download_url/list/delete. Other times, I'm working in a particular repository with data connected to that project, and I really want the more "git-lfs" style workflow with the ability to track, say, all *.tif or *.tsv.gz files. Does that make sense? Can I do a better job of explaining that somewhere?

Re GITHUB_PAT -- as you probably saw, piggyback will recognize that name too. But I do think that choice was one of Hadley's rare mistakes, GITHUB_TOKEN appears in some 86K code files on GitHub vs only 2K that use GITHUB_PAT, and gh package also recognizes both. I probably should do something like devtools does to provide a default public token though. I would import the devtools method directly but I'm hesitant to do so while the devtools diaspora still seems underway; not sure where that routine will end up. I've tried instead to build on usethis routines here whenever possible. Perhaps @jennybc has insight here?

Re exporting guess_repo(), I agree that's a useful pattern, I wish I could import it from usethis or git2r myself!, but I think I should keep it internal precisely because it's quite general -- I really wouldn't want people introducing a revdep on piggyback just for that function, and I don't want to guarantee that method is robust and general, rather than an internal method whose behavior might change.

jennybc · 2018-06-15T21:06:02Z

Re: GitHub PAT

FWIW in usethis I have basically inlined gh::gh_token() (which is now exported by dev gh):

https://github.com/r-lib/usethis/blob/3cc50ae7c4877d5f10747d0e8f25469897206cfe/R/github.R#L261-L266

It's not great but I don't know what else to do.

Re: guess_repo()

FWIW the tidyverse/r-lib team has Git remote URL parsing all over the place, which we would dearly love to centralize. We plan to make a mini-package for this (https://github.com/r-lib/ghurl/issues). It's seems incredibly specific but actually many other languages have such a thing. It actually comes up a lot! I think you could use your own function for now, but not encourage external use and hopefully we'll make this other thing available to ourselves and all.

cboettig · 2018-06-15T21:17:45Z

Thanks @jennybc, this is great. Yeah, I have almost the same inline call in

https://github.com/cboettig/piggyback/blob/ac5acf06f85e768347407c46b505792621fdb591/R/gh.R#L304-L306

I wonder if it would be worth including a default "public" token, the way devtools::github_pat() does, so that tasks not requiring auth don't run afoul of rate limiting on tokenless calls, or if that would create more confusion than benefit...

I'm 💯% for mini-packages for common utilities. much better than duplicating the same utility code internally or exporting that functionality from a package in which it's orthogonal to the actual purpose/API of the package. ghurl 👀

annakrystalli · 2018-06-21T07:23:57Z

Thanks to @mpadge for volunteering and using the package already, we now have both reviewers! 🎉

Reviewers: @coatless @mpadge
Due date: 2018-07-12

annakrystalli · 2018-06-21T08:11:13Z

@cboettig also, I can't see the Under Review badge in the README anymore?

cboettig · 2018-06-21T17:35:53Z

@annakrystalli Karthik says badge server got restarted for upgrades, looks like it is back now.

cboettig · 2018-07-17T03:00:56Z

Hey @annakrystalli , thanks for the ping. I think this is mostly waiting for me to finish out ropensci/piggyback#8, which is about half-done on the timestamps branch (along with about 40% of the error message handling added there as well).

I take the thumbs up from @mpadge to mean he's okay with me leaving in push/pull and upload/download as separate APIs and not having the low-level wade into the additional complexity possible failure points of dealing with a manifest, but maybe I'm just wishing that interpretation on it.

Though they may not be an ideal solution, support for timestamps will work on the low-level API (as well as the high-level); might even be better than the manifest/hash approach of the push/pull API, effectively making it the simplification Mark is looking for?

I'll ping again once I've merged that and dealt with error messages, or if I have any additional questions about the error messages. Mark's list of specific failure modes is super helpful so I think I can at least improve on the errors and have a revised version to review by, say, the end of the month.

cboettig · 2018-07-25T13:28:42Z

@mpadge Okay, I think I've hit all of these issues at this point.

You should get more helpful error messages now, see test-errors.R for examples.

In some cases I have opted for warnings rather than errors (where I can imagine things being looped over and not wanting breaking errors), maybe that should be enforced more strictly?
the messages should be much more helpful, but some are short of your ideal. Mostly this is because I wanted to avoid introducing additional API calls merely to debug, since that both slows things down and increases more possible errors.

Regarding the unified interface, I think the support for timestamps by default in pb_upload / pb_download accomplishes much of the interface you might want out of pb_pull() / pb_push(). To take this one step further towards making pb_push/pb_pull dispensible, I've also modified pb_track: pb_track() now invisibly returns a list of all tracked files, and can optionally be called with no arguments. This might facilitate a common workflows such as:

pb_track("*.csv") %>% pb_upload()

Since it still writes to .pbattributes it will remember previously tracked patterns:

pb_track("*.csv")
pb_track("data/*")
write_csv(mtcars, "data/mtcars.csv") # note file doesn't have to exist when pattern was added

pb_track() %>% pb_upload()

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence. Feedback from you, @annakrystalli , @coatless etc would be great on that call as well.

I think I've hit all the little things mentioned as well. I have a few references in the paper.md (see paper.bib), using pandoc.

mpadge · 2018-07-27T09:43:23Z

Thanks @cboettig. I'll take it our for a test ride in the coming week and report back then

coatless · 2018-07-27T14:01:43Z

@cboettig thanks for the update. I'll check it out this weekend.

annakrystalli · 2018-08-06T09:30:15Z

Hello @mpadge and @coatless! Any news on your test drives of the changes? Please let me know if you are happy that the changes address all your concerns.

@cboettig my personal feeling on:

So the only thing pull and push do that you cannot get with the above workflow is the manifest hashes, which I'm still not sure is sufficiently robust to be that useful. Maybe pull and push should be dropped at this point, but I'm still on the fence.

is to remove pull and push if they are largely redundant and maybe return them later on if you are happy they robust enough? If not, perhaps noting this background info somewhere in the documentation (so users can easily make an informed choice on which workflow best suits their needs) could also work. Ultimately I'd weigh @mpadge and @coatless feedback on this more though.

mpadge · 2018-08-15T08:54:29Z

Thanks @cboettig for the great improvements to the package, and sorry for the slight delay getting back to you. The following comments are mere suggestions that in the subjective opinions of this reviewer might just help to make it even more useful and/or easy to use.

Error messages and tags

As indicated by @cboettig, most previously hard-to-decipher error messages have been addressed, and the package is as a result considerably more user friendly. There nevertheless remain a couple of unhelpful error messages prompted by passing inappropriate arguments through to package dependencies, primarily httr. One simple one of these is:

pb_upload ("non-existent-file.Rds", repo = "mpadge/myrepo")

This still passes non-existent file name through to httr and returns file.exists(path) is not TRUE. Please insert a simple catch at start of pb_upload():

if (!file.exists (file))
    stop ("file ", file, " does not exist")

The remainder may be more effectively addressed through first extracting all tags for a repository and then using those throughout to check each tag parameter. The extraction of tags is very simple using the graphql interface. My suggestion would be to ask @maelle if you might kindly borrow her create_client function from the ghrecipes package:

create_client <- function(){
  token <- Sys.getenv("GITHUB_GRAPHQL_TOKEN")
  if (token == "")
  {
    stop ("Please set your GITHUB_GRAPHQL_TOKEN environment variable.")
  } else
  {
    if (!exists ("ghql_gh_cli")) {
      ghql_gh_cli <- ghql::GraphqlClient$new (
        url = "https://api.github.com/graphql",
        headers = httr::add_headers (Authorization =
                                     paste0 ("Bearer ", token))
      )
      ghql_gh_cli$load_schema()
      ghql_gh_cli <<- ghql_gh_cli
    }
    return (ghql_gh_cli)
  }
}

A get_tags function could then simply be written like this:

get_tags <- function (owner, repo)
{
    qtxt <- paste0 ('{
                    repository(owner:"', owner, '", name:"', repo, '") {
                        tags:refs(refPrefix:"refs/tags/", last:50) {
                            edges {
                                tag:node {
                                    name
                                }
                            }
                        }
                    }
    }')
    qry <- ghql::Query$new()
    qry$query('gettags', qtxt)
    dat <- create_client()$exec(qry$queries$gettags) %>%
        jsonlite::fromJSON ()
    if (is.null(dat$data$repository)) stop(dat$errors$message)

    dat$data$repository$tags$edges
}
get_tags ("mpadge", "junk")

This obviously introduces another dependency (ghql), but I suspect this package will likely be improved at some stage through incorporating ghql calls anyway. These tags can be cached somehow on the first call for a give repo, and then re-used throughout a session. A simple

tags <- memoise::memoise (get_tags (owner, repo))

would do the trick (sorry, requiring yet another dependency, but I also really think the package will at some stage require some more sophisticated caching like this).

This would then allow the kind of prompt I suggested in the initial review along the lines of

pb_upload (..., tag = "non-existent-tag")
# Tag 'non-existent-tag' does not exist in that repository. Would you like it to be created?

And it would happen instantly, without recourse to any API calls.

This would also enable error messages for the following call:

pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists")

At the moment this still returns the entirely illegible:

# Error in pb_new_release(repo = "mpadge/junk", tag = "this-tag-already-exists") :
#   Unprocessable Entity (WebDAV; RFC 4918) (HTTP 422).

other stuff

pb_list

This function still does not return tag information - did you intend to incorporate that? I would reiterate most of the comments I previously made regarding this function.

pb_delete

This call now returns an appropriate error message:

pb_delete (file = "non-existent-data", repo = "mpadge/junk", tag = "v1")
# non-existent-data not found on GitHub

As previously stated, it would still be nice to be able to delete an entire release. At the moment this is not possible:

pb_delete (repo = "mpadge/junk", tag = "v1")
# Error in asset_filename(file) :
#   argument "file" is missing, with no default

Any comments on potentially enabling that?

pb_track

This does indeed work as @cboettig suggests, but I note that pb_upload is not vectorised:

pb_track ("data/*") # where * matches >= 2 files
pb_track () %>% pb_upload (repo = "mpadge/junk", tag = "v1")
# Error in vapply(elements, encode, character(1)) :
#   values must be length 1,
#  but FUN(X[[1]]) result is length 2

It would be very useful to simply make current pb_upload some kind of pb_upload_one function and then have pb_upload be a simple vapply call to that.

I'll leave my comments there for the moment, and await feedback from @cboettig

cboettig · 2018-08-15T17:15:35Z

Thanks @mpadge, really appreciate the detailed feedback and agree these things should be done. My apologies that most of the things in "other stuff" look like oversights on my part, and should be straight forward to implement.

The only thing I'm on the fence on is the tags etc, not because I disagree about doing it, but that I think it's going to require more thought on the right way to do this. Like you say, ultimately this package should really be caching some relatively detailed metadata about the package. Of course this will require some more invasive surgery to implement...

I'm not sure memoise is the way to do this, since memoise invalidates the cache only when the function arguments change, and here the repo information is likely to change without that happening. Still, it would be a convenient way to do so for sure, effectively making the cache persist through a session, so it might be an decent starting point.

I think I can get all the tags easily enough from the standard REST API, instead of ghql. Will have to dig into this...

Mark, I'd also like to ask your permission to list you as a contributor to the package, if that's okay with you? Your detailed input has had a substantial contribution to the package.

mpadge · 2018-08-17T08:10:23Z

👍 Re: "ctb": 👍 of course! Thanks!

Re: tags + caching: Yeah, some invasive surgery will be required, and yeah, memoise is perhaps not ideal for the reasons you state, but then again pb_new_release is the only function that should change tags (at least until, or if, pb_delete(tag = ...) is incorporated), so consistent cache updating would just require re-running memoise(get_tags()) within pb_new_release.

And yeah, API3 would also be entirely sufficient, because tag info will likely never run into rate restriction limits. I just got personally converted to API4 through @maelle's graphql evangelising, and don't plan on going back.

I hope all this back-and-forthing is okay with you, and I hope this doesn't come across as me being too picky. I really do just want to help make an already supremely useful package even cleaner and easier to use, and I strongly suspect that at least finding an initial system for interim caching will help make future package development much easier and smoother.

annakrystalli · 2018-08-29T17:20:17Z

Hi @cboettig! Really happy to see the discussions and development of this package so far. Feels like it is on a good road. Was just wondering whether you and @mpadge feel that an acceptable version of the package has been reached and therefore can complete the review process, with further discussed development to be noted for future versions?

cboettig · 2018-08-29T17:23:41Z

@annakrystalli I think I still need to tick off the above items, at least the easy ones, and would like to run them by @mpadge one more time. Shouldn't be too much work, I've just been 🌊 by the start of the semester and need to get 🏊 again first!

annakrystalli · 2018-08-29T18:27:21Z

OK, not to worry. I'll keep an eye on the thread for progress. 💪 🏊‍♂️

cboettig · 2018-09-04T23:23:14Z

@mpadge Okay, think I've hit all of these. A few strategic questions remain before we close this out.

First, I now have just about all functions call the internal method, pb_info(), which can get all assets associated with all tags in a single API call. Still, I haven't done any caching on this, because I cannot figure out a robust strategy for that. The only time I think caching this would be safe is within a vectorized version of the functions, and maybe I should handle that case explicitly? Otherwise, one can never be sure if a user, for instance, creates or deletes a release using the GitHub web interface in between two consecutive function calls in an interactive session. Thoughts on this?

Second, I'm warming more and more to the idea of just dropping the whole pb_push, pb_pull functions (while maybe keeping pb_track()?) e.g. this would eliminate the use of a manifest.json and hashes to know if a file had changed -- relying totally on timestamps instead. Yes / no?

@coatless , any thoughts on these?

coatless · 2018-09-06T02:17:46Z

@cboettig a few. I need to sit down and type them up. Unfortunately, I've been slammed with the semester start and meetings. I'm going to try to carve out time on Friday.

coatless · 2018-09-12T05:08:18Z

In response to the latest points raised...

Caching

Regarding caches, we must mention them in the frame of:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Having said this, I'm not sure caching needs to be attempted given the current rate limits. If the user is authenticated, which is a key requirement of many of piggyback's functions, the number of times the API can be queried is 5000. For users who do not supply a GH_TOKEN, they have 60 queries per hour (c.f. Rate Limiting Overview, Rate Limit Rules).

If you feel you must do some level of caching, the two heavy users of pb_info() are gh_download_asset() used in pb_download() and pb_upload_file() called by pb_upload(). The easiest way to cache results would be with memoise as @mpadge initially indicated. I say this because an option does exist to supply a timeout (c.f. timeout()).

As an example consider:

# Create a memoized pb_info function that expires after 10 seconds
# Note: The presently exposed pb_info is renamed to local_pb_info in this case
pb_info <- memoise(local_pb_info, ~timeout(10))

# Grab repo information
pb_info("cboettig/piggyback")

Though, to avoid another package dependency, the cache setup could be done by maintaining your own closure with a timeout procedure. Slight code sketch:

# Make shift caching
repo_cache = function(timeout = getOption("pb.cache_timeout", 120)) {
  
  # Create internal tracking variables
  start_time <- repo_data <- repo <- tag <- .token = NULL
  
  # Initialize Clock
  start = function() { start_time <<- Sys.time() }
  
  # Split Time
  time_point <- function() {
    # Difference between NOW and start in seconds
    difftime(Sys.time(), start_time, units = "secs")
  }
  
  # Store the repo data
  cache_data <- function() {
    repo_data <<- piggyback:::pb_info(repo = repo, tag = tag, .token = .token)
    start()

    repo_data
  }

  # Create initial cache (public-facing)
  set_cache_info <- function(repo = guess_repo(), tag = NULL, .token = piggyback:::get_token()) {
    repo <<- repo; tag <<- tag; .token <<- .token
    
    cache_data()
  }
  
  # Compare to time stamp
  # If bad, then flush cache
  check_cache_state <- function() {
    if( time_point() > timeout ){
       cache_data()
    }
  }
  
  # Retrieve value and reset if need be
  get_cache <- function() {
    check_cache_state()
    repo_data
  }
  
  # Return functions
  list(set_cache_info = set_cache_info, get_cache = get_cache )
}

# Allocate the closure
pb_cache = repo_cache()

# Setup repo caching
pb_cache$set_cache_info("cboettig/piggyback")
pb_cache$get_cache()

Misc note: Glancing over memoise, it provides an impressive cloud backend to storing information on S3 + GCP that could potentially be harnessed; however, I feel that is out of the initial scope for piggyback.

Narrowing the API

I'm largely in agreement with sunsetting both pb_push() and pb_pull(). These functions heavily rely on an underlying mental model of the git. At that point, I think that the user would likely just commit into git already since the barrier of entry is significantly higher due to the need of a manifest.json file.

Package style

Style-wise, the styler::style_pkg() found a few issues. I've spun up a PR for you to include the changes in ropensci/piggyback#14

Examples

Part of the piggyback repository is serving as a "checkpoint" for data (c.f. data tag). I wonder if such an example would be better suited inside of a secondary repository. The secondary repository could then be linked from a code example in the vignette and/or README.md as an example of what is possible.

I have a few more remarks, but the night is no longer young. I'll aim to finish round 2 later in the week.

cboettig · 2018-09-18T05:20:35Z

Thanks @coatless for the detailed review.

Caching: memoise with timeout sounds very compelling to me! I think I should just do this with a very short time-out. That would give you the benefit of caching when it matters most, like looping over a large number of piggyback calls; while not caching if someone just has the same R session open for a long time.
Sunsetting pb_push / pb_pull -- Yes, I largely agree with this. I wouldn't agree with "user should just commit to git" because I think the user should probably already do that if files are << 50 MB. If they are greater than 50 MB, committing to git immediately breaks one's ability to push to GitHub, and is not easy to remedy either; a mistake I've seen dozens of times and was hoping piggyback would fix (users don't have to know manifest.json exists -- the overhead involved there is more of a problem for me the developer).

I think this really comes down to what semantics are clearest for the user now: functionally, pb_track() %>% pb_upload() is now the moral equivalent of pb_push(). I thought invoking the git metaphor would be helpful to inspire users to make use of functionality like pb_track() to automatically list, say, all *.csv files in a project. pb_upload() seems to imply that a user somehow has to remember which files need to be uploaded. If we do kill pb_push, I think fewer users may use pb_track() to create the .pbattributes file of all tracked data patterns. But maybe I can address this with sufficient documentation.

styler PR merged, thanks! (I cannot get my styler installation to automatically create linebreaks... not sure what's up there, but glad I now know about this package!)

Let me know your remaining remarks, looking forward to it!

cboettig · 2018-09-18T19:19:19Z

Okay, caching turned out to be a bit more of an adventure than I expected with memoize.

A 1 second cache on pb_info breaks a good number of tests and vignettes. Many of these issues were fixed as soon as I made sure functions that update the GitHub release (pb_upload, pb_release) would also break the cache automatically.

I also wanted a way to turn off caching, turns out memoise::timeout(0) doesn't do what you expect, so I hacked around that. Now cache duration can be set as an env var, piggyback_cache_duration, and setting it to 0 will turn caching off.

I've also sunset pb_push and pb_pull. I've updated the docs to reflect this, also making the readme briefer and creating two separate vignettes; one with the longer intro to functions that was largely in the readme, and a separate vignette on alternative approaches to piggyback.

Remaining questions

One possibly significant issue I worry about for CRAN onboarding is the "write to temporary files" complaint. This is particularly likely to bite on pb_download() and I'm not sure how best to address it without losing the convenient default behavior based on the git repo root. Maybe since it defaults to usethis::proj_get() and not "." they will be okay with it or won't notice? (Arguably by setting an RStudio Project the user has declared a location(?). proj_get() will throw an error if the user is not in a project, so my argument is that this is not an actual default location (unlike here(), which doesn't error). Thoughts on this?

mpadge · 2018-09-19T07:58:27Z

Thanks @cboettig for doing the hard work for us all on memoise::timeout() - very useful to know! And as you know, I think the retirement of pb_push/pull is a strong boost for ease and consistency of use - thanks.

As for file writing, I don't have any great insights, and would love to be assured that what you do were permissible. I note only that:

I just scoured the code for usethis, and it seems they exclusively use file_temp() for writing with default tmp_dir = tempdir() at all times (yes, even when testing proj_get(), as in this example); and
CRAN have very strong auto-checks for any attempts to write to non-standard locations which have caught me out previously, so current usage will definitely be noticed. Have you tried testing on all rhub platforms to see whether any of those flag anything?

An after-thought: How about having default location as usethis::proj_get(), but enabling manual override so all tests could then use tempdir()?

cboettig · 2018-09-19T17:21:09Z

@mpadge Thanks for your thoughts on this. Yeah, pb_download already supports specifying dest = tempdir() while defaulting to dest = usethis::proj_get().

I've tried to consistently make sure tests, examples, and vignettes use tempdir in the calls (well, at least tests, I probably need another pass over this to add it to examples and vignettes). I don't really like this because it makes the examples look more cumbersome than they really are, but that is somewhat unavoidable I guess. (It's not just tempdir(); following @coatless 's very excellent point that I should use a different repo for testing, but this means that all functions include a manual declaration of repo = "cboettig/piggyback-tests" as well now.

My question is though, do you think that's good enough? I'm worried that having any default location at all is technically a violation of policy. (e.g. in base function download.file(), dest has no default).

(Though I still feel this is not technically a default location, since proj_get() throws an error unless the user has set a project).

cboettig · 2018-09-20T15:38:58Z

@annakrystalli @mpadge @coatless

Okay, I think I basically have things where I want them now in response to all the super excellent feedback. You have all made this a way more usable, useful, and robust package. I'm sure it's not perfect, but I think we've hammered out a nice API and functionality to get people started.

I think we have the 👍 from Mark to onboard, but still waiting final confirmation from James, yes?

@annakrystalli Remind me any additional steps I need to do for the JOSS part? I just go over there and open an issue pointing back here? Mark/Jim, are you 👍 with the JOSS submission?

coatless · 2018-09-20T15:42:21Z

👍 to onboarding.

annakrystalli · 2018-09-21T08:16:02Z

Approved! So happy to see so much great work done on this package through review. Many thanks @cboettig for submitting and @coatless & @mpadge for your reviews! 😺

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

To-dos:

Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
Add the rOpenSci footer to the bottom of your README
" [![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)"
Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
Activate Zenodo watching the repo
Tag and create a release so as to create a Zenodo version and DOI
Submit to JOSS using the Zenodo DOI. We will tag it for expedited review.

As for more details on the JOSS process, once you've got your DOI from Zenodo, use it to submit to JOSS here http://joss.theoj.org/papers/new.

I believe this triggers a [PRE-REVIEW] issue. I'll flag it as accepted to ropensci once the issue is created. 👍

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing.

We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

mpadge · 2018-09-21T11:25:59Z

Thanks so much for the outstanding improvements @cboettig - I'm really looking forward to more tightly integrating this package within my workflow, and to contributing further issues to continue the constructive conversation. Congratulations!

stefaniebutland · 2018-09-21T17:15:44Z

@cboettig please do let me know if you'd like to do a post or tech note on this or arkdb!

cboettig · 2018-09-21T21:32:53Z

@stefaniebutland Thanks, yes I'd love to do a blog post on piggyback, maybe combine a bit of arkdb. I was thinking I might focus on why I wrote these two little packages, more than the technical details (each package basically has only two functions, essentially data_in and data_out), so they are conceptually simple anyway.

stefaniebutland · 2018-09-21T21:43:13Z

why I wrote these two little packages

yes!!

right now Tuesday 2018-10-16 or subsequent Tuesdays are available. Take your pick and then please submit via pull request a week prior to publication date. Some guidelines are here: https://github.com/ropensci/roweb2#contributing-a-blog-post

noamross assigned annakrystalli Jun 10, 2018

noamross added package 1/editor-checks labels Jun 10, 2018

annakrystalli added 2/seeking-reviewer(s) topic:data-publication topic:data-access topic:reproducibility and removed 1/editor-checks topic:data-publication labels Jun 11, 2018

annakrystalli added 3/reviewer(s)-assigned and removed 2/seeking-reviewer(s) labels Jun 21, 2018

cboettig mentioned this issue Sep 4, 2018

Remaining onboarding tasks ropensci/piggyback#12

Closed

annakrystalli added 6/approved and removed 4/review(s)-in-awaiting-changes labels Sep 21, 2018

annakrystalli mentioned this issue Sep 22, 2018

[REVIEW]: piggyback: Managing Larger Data on a GitHub Repository openjournals/joss-reviews#971

Closed

18 tasks

annakrystalli closed this as completed Sep 24, 2018

submission: piggyback #220

submission: piggyback #220

Comments

cboettig commented May 30, 2018

Summary

Requirements

Publication options

Detail

annakrystalli commented Jun 11, 2018 • edited Loading

Editor checks:

Editor comments

checks

tests

spell check

coverage

annakrystalli commented Jun 11, 2018

cboettig commented Jun 11, 2018 • edited by annakrystalli Loading

annakrystalli commented Jun 12, 2018 • edited Loading

annakrystalli commented Jun 12, 2018 • edited Loading

annakrystalli commented Jun 12, 2018

richfitz commented Jun 12, 2018

mpadge commented Jun 12, 2018

cboettig commented Jun 12, 2018

coatless commented Jun 13, 2018 • edited Loading

Package Review

Documentation

For packages co-submitting to JOSS

Functionality

Final approval (post-review)

Review Comments

Environment Variable Access

Package API

Function Parameters

Function Exports

Documentation

annakrystalli commented Jun 14, 2018

coatless commented Jun 14, 2018

annakrystalli commented Jun 14, 2018

cboettig commented Jun 15, 2018

jennybc commented Jun 15, 2018

cboettig commented Jun 15, 2018

annakrystalli commented Jun 21, 2018

annakrystalli commented Jun 21, 2018

cboettig commented Jun 21, 2018

cboettig commented Jul 17, 2018

cboettig commented Jul 25, 2018

mpadge commented Jul 27, 2018

coatless commented Jul 27, 2018

annakrystalli commented Aug 6, 2018

mpadge commented Aug 15, 2018

Error messages and tags

other stuff

pb_list

pb_delete

pb_track

cboettig commented Aug 15, 2018

mpadge commented Aug 17, 2018

annakrystalli commented Aug 29, 2018

cboettig commented Aug 29, 2018

annakrystalli commented Aug 29, 2018

cboettig commented Sep 4, 2018

coatless commented Sep 6, 2018 • edited Loading

coatless commented Sep 12, 2018

Caching

Narrowing the API

Package style

Examples

cboettig commented Sep 18, 2018 • edited Loading

cboettig commented Sep 18, 2018

Remaining questions

mpadge commented Sep 19, 2018 • edited Loading

cboettig commented Sep 19, 2018

cboettig commented Sep 20, 2018

coatless commented Sep 20, 2018

annakrystalli commented Sep 21, 2018 • edited by cboettig Loading

mpadge commented Sep 21, 2018

stefaniebutland commented Sep 21, 2018

cboettig commented Sep 21, 2018

stefaniebutland commented Sep 21, 2018

annakrystalli commented Jun 11, 2018 •

edited

Loading

cboettig commented Jun 11, 2018 •

edited by annakrystalli

Loading

annakrystalli commented Jun 12, 2018 •

edited

Loading

annakrystalli commented Jun 12, 2018 •

edited

Loading

coatless commented Jun 13, 2018 •

edited

Loading

coatless commented Sep 6, 2018 •

edited

Loading

cboettig commented Sep 18, 2018 •

edited

Loading

mpadge commented Sep 19, 2018 •

edited

Loading

annakrystalli commented Sep 21, 2018 •

edited by cboettig

Loading