-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submission: piggyback #220
Comments
Editor checks:
Editor commentsHi @cboettig 👋, thanks for submission and great to be working with another of your packages. I believe I remember you talking about this problem. A nice solution to it! Anyways, see below the result of my initial checks. I'm sure it's nothing that can't be fixed so I'm going to go ahead and approach reviewers. I'll take your suggestions onboard. One small suggestion on the README, for storing checksOK apart from 1 failling test. tests
spell checkflagging potential typos:
coverageUnable to check because of test failure. |
Also, could you also add the review badge to the README? 👍
|
@annakrystalli Thanks!
I'm not sure how to change the |
Aha 🤔, not sure how feasible it is but it feels like the only people that test should run for are the repo owner and any collaborators. So perhaps the test condition check could include a check that the gh user matches one of Also, all tests passing after disabling |
Also, current coverage,
As you mentioned, some tests didn't run. What sort of coverage are you getting? |
Sorry just found link to codecov. Ignore comment above. |
@cboettig - I use "opt in" environment variables for tests like this, something like |
Or universal local ones: I just set |
Thanks all, that's brilliant. An after a few boneheaded attempts I managed to pass the secure+public env vars to travis correctly this time. |
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 1.5hrs Review CommentsEnvironment Variable AccessThe environment variable for the Personal Access Token (PAT), Note, that I think it would also be advantageous to mimic the Package APIThe API for the package dances back'n'forth between treating the repository as the data and the data as a file. I can understand the use-cases for both; but, would it be better practice to emphasize the repository as the data? This would simplify the interface to mimic pre-existing APIs, e.g. Package API: library("piggyback")
lsf.str("package:piggyback")
#> pb_delete : function (repo = guess_repo(), tag = "latest", file, .token = get_token())
#> pb_download : function (file = NULL, dest = ".", repo = guess_repo(), tag = "latest",
#> overwrite = TRUE, ignore = "manifest.json")
#> pb_download_url : function (file = NULL, repo = guess_repo(), tag = "latest")
#> pb_list : function (repo = guess_repo(), tag = "latest", ignore = "manifest.json")
#> pb_new_release : function (repo = guess_repo(), tag, commit = "master", name = tag,
#> body = "Data release", draft = FALSE, prerelease = FALSE, .token = get_token())
#> pb_pull : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")
#> pb_push : function (repo = guess_repo(), tag = "latest", overwrite = TRUE, manifest = ".manifest.json")
#> pb_track : function (glob)
#> pb_upload : function (file, repo = guess_repo(), tag = "latest", name = NULL,
#> overwrite = FALSE, .token = get_token()) Function ParametersAnother notable issue is the placement of the positional parameter after defaults. e.g. pb_delete(repo = guess_repo(), tag = "latest", file, .token = get_token()) Consider putting Function ExportsIt may be beneficial to also export and document DocumentationOutside of that, it seems like there are a few artifact of an older API in the documentation examples and some exported functions are missing documentation: Missing ex: Outdated: |
@annakrystalli apologies for jumping the gun. I had a use case in mind when I saw the package. So, I just quickly took down a few notes. |
No worries at all @coatless! I'm not going to complain about having your review in! In fact, having a relevant use case handy often makes for a really useful review. |
@coatless thanks, this is great! How'd your use case go? I've fixed the missing and outdated examples. I've made most of the examples into Re the "file" vs "repo" interface, that's very intentional -- but perhaps I need to document it better. If I'm just trying to debug an example on some data, I'd rather use a file-based interface to quickly share the file over, say, a generic "scratch" repository (regardless of my current working repo, or if I'm not "in" any repo). Locally I can do: pb_upload("mtcars.tsv.gz", repo = "cboettig/scratch")
pb_download_url("mtcars.tsv.gz", repo = "cboettig/scratch") Then I can post a question like: "Hey Jim, why does this fail:" read.table("https://github.com/cboettig/scratch/releases/download/data/mtcars.tsv.gz") This needs the more file-based workflow of upload/download/download_url/list/delete. Other times, I'm working in a particular repository with data connected to that project, and I really want the more "git-lfs" style workflow with the ability to track, say, all Re Re exporting |
Re: GitHub PAT FWIW in usethis I have basically inlined It's not great but I don't know what else to do. Re: FWIW the tidyverse/r-lib team has Git remote URL parsing all over the place, which we would dearly love to centralize. We plan to make a mini-package for this (https://github.com/r-lib/ghurl/issues). It's seems incredibly specific but actually many other languages have such a thing. It actually comes up a lot! I think you could use your own function for now, but not encourage external use and hopefully we'll make this other thing available to ourselves and all. |
Thanks @jennybc, this is great. Yeah, I have almost the same inline call in I wonder if it would be worth including a default "public" token, the way I'm 💯% for mini-packages for common utilities. much better than duplicating the same utility code internally or exporting that functionality from a package in which it's orthogonal to the actual purpose/API of the package. ghurl 👀 |
@cboettig also, I can't see the Under Review badge in the README anymore? |
@annakrystalli Karthik says badge server got restarted for upgrades, looks like it is back now. |
Hey @annakrystalli , thanks for the ping. I think this is mostly waiting for me to finish out ropensci/piggyback#8, which is about half-done on the I take the thumbs up from @mpadge to mean he's okay with me leaving in Though they may not be an ideal solution, support for timestamps will work on the low-level API (as well as the high-level); might even be better than the manifest/hash approach of the push/pull API, effectively making it the simplification Mark is looking for? I'll ping again once I've merged that and dealt with error messages, or if I have any additional questions about the error messages. Mark's list of specific failure modes is super helpful so I think I can at least improve on the errors and have a revised version to review by, say, the end of the month. |
@mpadge Okay, I think I've hit all of these issues at this point. You should get more helpful error messages now, see
Regarding the unified interface, I think the support for timestamps by default in
Since it still writes to pb_track("*.csv")
pb_track("data/*")
write_csv(mtcars, "data/mtcars.csv") # note file doesn't have to exist when pattern was added
pb_track() %>% pb_upload() So the only thing I think I've hit all the little things mentioned as well. I have a few references in the paper.md (see paper.bib), using pandoc. |
Thanks @cboettig. I'll take it our for a test ride in the coming week and report back then |
@cboettig thanks for the update. I'll check it out this weekend. |
Hello @mpadge and @coatless! Any news on your test drives of the changes? Please let me know if you are happy that the changes address all your concerns. @cboettig my personal feeling on:
is to remove |
Thanks @cboettig for the great improvements to the package, and sorry for the slight delay getting back to you. The following comments are mere suggestions that in the subjective opinions of this reviewer might just help to make it even more useful and/or easy to use. Error messages and tagsAs indicated by @cboettig, most previously hard-to-decipher error messages have been addressed, and the package is as a result considerably more user friendly. There nevertheless remain a couple of unhelpful error messages prompted by passing inappropriate arguments through to package dependencies, primarily
This still passes non-existent file name through to
The remainder may be more effectively addressed through first extracting all tags for a repository and then using those throughout to check each
A
This obviously introduces another dependency (
would do the trick (sorry, requiring yet another dependency, but I also really think the package will at some stage require some more sophisticated caching like this). This would then allow the kind of prompt I suggested in the initial review along the lines of
And it would happen instantly, without recourse to any API calls. This would also enable error messages for the following call:
At the moment this still returns the entirely illegible:
other stuffpb_listThis function still does not return tag information - did you intend to incorporate that? I would reiterate most of the comments I previously made regarding this function. pb_deleteThis call now returns an appropriate error message:
As previously stated, it would still be nice to be able to delete an entire release. At the moment this is not possible:
Any comments on potentially enabling that? pb_trackThis does indeed work as @cboettig suggests, but I note that
It would be very useful to simply make current I'll leave my comments there for the moment, and await feedback from @cboettig |
Thanks @mpadge, really appreciate the detailed feedback and agree these things should be done. My apologies that most of the things in "other stuff" look like oversights on my part, and should be straight forward to implement. The only thing I'm on the fence on is the tags etc, not because I disagree about doing it, but that I think it's going to require more thought on the right way to do this. Like you say, ultimately this package should really be caching some relatively detailed metadata about the package. Of course this will require some more invasive surgery to implement... I'm not sure I think I can get all the tags easily enough from the standard REST API, instead of Mark, I'd also like to ask your permission to list you as a contributor to the package, if that's okay with you? Your detailed input has had a substantial contribution to the package. |
👍 Re: "ctb": 👍 of course! Thanks! Re: tags + caching: Yeah, some invasive surgery will be required, and yeah, And yeah, API3 would also be entirely sufficient, because tag info will likely never run into rate restriction limits. I just got personally converted to API4 through @maelle's graphql evangelising, and don't plan on going back. I hope all this back-and-forthing is okay with you, and I hope this doesn't come across as me being too picky. I really do just want to help make an already supremely useful package even cleaner and easier to use, and I strongly suspect that at least finding an initial system for interim caching will help make future package development much easier and smoother. |
Hi @cboettig! Really happy to see the discussions and development of this package so far. Feels like it is on a good road. Was just wondering whether you and @mpadge feel that an acceptable version of the package has been reached and therefore can complete the review process, with further discussed development to be noted for future versions? |
@annakrystalli I think I still need to tick off the above items, at least the easy ones, and would like to run them by @mpadge one more time. Shouldn't be too much work, I've just been 🌊 by the start of the semester and need to get 🏊 again first! |
OK, not to worry. I'll keep an eye on the thread for progress. 💪 🏊♂️ |
@mpadge Okay, think I've hit all of these. A few strategic questions remain before we close this out. First, I now have just about all functions call the internal method, Second, I'm warming more and more to the idea of just dropping the whole @coatless , any thoughts on these? |
@cboettig a few. I need to sit down and type them up. Unfortunately, I've been slammed with the semester start and meetings. I'm going to try to carve out time on Friday. |
In response to the latest points raised... CachingRegarding caches, we must mention them in the frame of:
Having said this, I'm not sure caching needs to be attempted given the current rate limits. If the user is authenticated, which is a key requirement of many of If you feel you must do some level of caching, the two heavy users of As an example consider: # Create a memoized pb_info function that expires after 10 seconds
# Note: The presently exposed pb_info is renamed to local_pb_info in this case
pb_info <- memoise(local_pb_info, ~timeout(10))
# Grab repo information
pb_info("cboettig/piggyback") Though, to avoid another package dependency, the cache setup could be done by maintaining your own closure with a timeout procedure. Slight code sketch: # Make shift caching
repo_cache = function(timeout = getOption("pb.cache_timeout", 120)) {
# Create internal tracking variables
start_time <- repo_data <- repo <- tag <- .token = NULL
# Initialize Clock
start = function() { start_time <<- Sys.time() }
# Split Time
time_point <- function() {
# Difference between NOW and start in seconds
difftime(Sys.time(), start_time, units = "secs")
}
# Store the repo data
cache_data <- function() {
repo_data <<- piggyback:::pb_info(repo = repo, tag = tag, .token = .token)
start()
repo_data
}
# Create initial cache (public-facing)
set_cache_info <- function(repo = guess_repo(), tag = NULL, .token = piggyback:::get_token()) {
repo <<- repo; tag <<- tag; .token <<- .token
cache_data()
}
# Compare to time stamp
# If bad, then flush cache
check_cache_state <- function() {
if( time_point() > timeout ){
cache_data()
}
}
# Retrieve value and reset if need be
get_cache <- function() {
check_cache_state()
repo_data
}
# Return functions
list(set_cache_info = set_cache_info, get_cache = get_cache )
}
# Allocate the closure
pb_cache = repo_cache()
# Setup repo caching
pb_cache$set_cache_info("cboettig/piggyback")
pb_cache$get_cache() Misc note: Glancing over Narrowing the APII'm largely in agreement with sunsetting both Package styleStyle-wise, the ExamplesPart of the I have a few more remarks, but the night is no longer young. I'll aim to finish round 2 later in the week. |
Thanks @coatless for the detailed review.
I think this really comes down to what semantics are clearest for the user now: functionally,
Let me know your remaining remarks, looking forward to it! |
Okay, caching turned out to be a bit more of an adventure than I expected with memoize. A 1 second cache on I also wanted a way to turn off caching, turns out I've also sunset Remaining questionsOne possibly significant issue I worry about for CRAN onboarding is the "write to temporary files" complaint. This is particularly likely to bite on |
Thanks @cboettig for doing the hard work for us all on As for file writing, I don't have any great insights, and would love to be assured that what you do were permissible. I note only that:
An after-thought: How about having default location as |
@mpadge Thanks for your thoughts on this. Yeah, I've tried to consistently make sure tests, examples, and vignettes use tempdir in the calls (well, at least tests, I probably need another pass over this to add it to examples and vignettes). I don't really like this because it makes the examples look more cumbersome than they really are, but that is somewhat unavoidable I guess. (It's not just My question is though, do you think that's good enough? I'm worried that having any default location at all is technically a violation of policy. (e.g. in base function (Though I still feel this is not technically a default location, since |
@annakrystalli @mpadge @coatless Okay, I think I basically have things where I want them now in response to all the super excellent feedback. You have all made this a way more usable, useful, and robust package. I'm sure it's not perfect, but I think we've hammered out a nice API and functionality to get people started. I think we have the 👍 from Mark to onboard, but still waiting final confirmation from James, yes? @annakrystalli Remind me any additional steps I need to do for the JOSS part? I just go over there and open an issue pointing back here? Mark/Jim, are you 👍 with the JOSS submission? |
👍 to onboarding. |
Approved! So happy to see so much great work done on this package through review. Many thanks @cboettig for submitting and @coatless & @mpadge for your reviews! 😺 Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them To-dos:
As for more details on the JOSS process, once you've got your DOI from Zenodo, use it to submit to JOSS here http://joss.theoj.org/papers/new. I believe this triggers a [PRE-REVIEW] issue. I'll flag it as accepted to ropensci once the issue is created. 👍 Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. (https://ropensci.org/blog/). If you are interested, @stefaniebutland will be in touch about content and timing. We've started putting together a gitbook with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here. |
Thanks so much for the outstanding improvements @cboettig - I'm really looking forward to more tightly integrating this package within my workflow, and to contributing further issues to continue the constructive conversation. Congratulations! |
@cboettig please do let me know if you'd like to do a post or tech note on this or |
@stefaniebutland Thanks, yes I'd love to do a blog post on |
yes!! right now Tuesday 2018-10-16 or subsequent Tuesdays are available. Take your pick and then please submit via pull request a week prior to publication date. Some guidelines are here: https://github.com/ropensci/roweb2#contributing-a-blog-post |
Summary
Allow large and binary data files to "piggyback" on top of your existing repositories.
push
andpull
large-ish (< 2GB) data files to & from GitHub repositories as attachments to a GitHub release;Paste the full DESCRIPTION file inside a code block below:
https://github.com/cboettig/piggyback
reproducibility
, because accessing data being analyzed is essential for reproducible workflows, and yet we have no good solution for workflows with unpublished data or private workflows to do this once the data is too large for version control (e.g. files > 50 mb).The target audience is anyone working with data files on GitHub.
yours differ or meet our criteria for best-in-category?
datastorr
onropenscilabs
is the closest match, which takes a very different approach (from the user perspective -- on the back end both store data on GitHub assets) to the essentially the same problem. The Intro vignette discusses at greater length many of the alternative possible strategies and why I feel they have all fallen short of my needs and led to me creating this package.Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.Detail
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:No errors, notes, or warnings.
Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
Rich FitzJohn, @richfitz, would be great based on his experience in this area and with
datastorr
. Jenny Bryan, @jennybc, since this package makes heavy use ofusethis
and GitHub interactions.The text was updated successfully, but these errors were encountered: