-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bowerbird #139
Comments
Editor checks:
Editor commentsThanks for your submission @raymondben! I'm currently seeking one reviewer (already found one)! Here is the output of It is good practice to
✖ write unit tests for all functions, and all package code in general. 79% of
code lines are covered by test cases.
R/aadc_eds_handler.R:14:NA
R/aadc_eds_handler.R:15:NA
R/aadc_eds_handler.R:16:NA
R/aadc_eds_handler.R:17:NA
R/aadc_eds_handler.R:27:NA
... and 385 more lines
✖ write short and simple functions. These functions have high cyclomatic
complexity:bb_source (56).
✖ avoid long code lines, it is bad for readability. Also, many people prefer
editor windows that are about 80 characters wide. Try make your lines
shorter than 80 characters
R\aadc_eds_handler.R:28:1
R\aadc_eds_handler.R:32:1
R\aadc_eds_handler.R:36:1
R\aadc_eds_handler.R:38:1
R\aadc_eds_handler.R:40:1
... and 480 more lines
✖ avoid calling setwd(), it changes the global environment. If you need it,
consider using on.exit() to restore the working directory.
R\aadc_eds_handler.R:44:5
R\amps_handler.R:44:9
R\provenance.R:50:5
R\sync.R:63:5
R\utils.R:77:5
... and 2 more lines
Reviewers: @MilesMcBain @lwasser |
Minor comment on the goodpractice output, which I meant to include in the initial submission: |
Ok makes sense @raymondben, thanks for this comment! |
Thanks for agreeing to review this package @MilesMcBain @lwasser! 😸 Your reviews are due on 2017-09-04. As a reminder here are links to the recently updated reviewing and packaging guides and to the review template. |
With the Editor's permission, I am posting some feedback ahead of a full review in the interests of discussing potential structural changes as early as possible. The issue of separating the core functionality from data sources was raised by the editors and the more I poke around under the hood, the more I am convinced this will also be my recommendation. Part of my motivation is selfish. The number of sources
|
Thanks @MilesMcBain! |
Thanks @MilesMcBain. I'll confess I'm not 100% sold on the splitting idea, but I'm also not vehemently against it and your points are good. @mdsumner and I will have a chat and see what we come up with. In the meantime, suggest that you and @lwasser don't spend too much more time on it, since it's likely to look fairly different if we do split it. |
OK, so: on reflection, we agree that splitting is probably a good idea. @maelle, what's the preference here? Do you want to close this issue and we resubmit, or do we just carry on here? (The split version --- not entirely finalized yet --- is currently sitting in the "split" branch.) Note that we'll only submit the core part of the package at this stage, including a few example data sources. The extended collection of data sources will be hived off into a separate package for further work when time permits, and possible submission. For completeness, some thoughts on Miles' comments above:
Moving sources into one or more other packages doesn't reduce the maintenance burden, it just spreads it across packages. We see occasional but not overwhelming changes to remote servers that require tweaks to the handlers, but over the 10+ years that we've been doing this (in various guises) we haven't found it to be a showstopper. However, splitting the package (into "core" plus other data-themed package(s) that depend on it) should indeed mean that the core components will remain more stable. That alone seems like a pretty good argument in favour of the split.
Yes, agreed, but it's worth exploring this a bit. The aim of bowerbird is to be flexible enough to be able to handle a wide range of data sources, and to allow new data sources to b e added fairly easily. Hence bowerbird itself doesn't generally attempt to wrap syntactic sugar around the wget arguments, because it's not possible to do it in a general manner. It's not just dates you might want to subset, it could also be geographic space, or vertical levels within a climate model, or different file formats, or any number of other facets depending on the data set --- and different data providers will structure their data directories and files differently. There's simply not enough consistency. But in most cases this sort of subsetting isn't necessary, and the wget calls are fairly straightforward --- so that aim of bowerbird (ability to add new data sources fairly easily) generally seems to hold up. More complicated examples will always arise, but that's just the nature of the task. Now, for sure, nicer function arguments could be part of more targeted packages (that might themselves depend on bowerbird) --- but doing so immediately makes the maintenance burden high because of the complexity and diversity of those options, and hence such packages tend to focus on a small number of data providers. So where is it best to invest that effort? Is it at the data-fetching level? Maybe. But a data-fetching package might well be used to maintain a data library on behalf of a number of users (this is what we do, anyway). In this scenario you're generally likely to want all files that are part of a given collection, because it's probably impossible to know a priori which ones your users will want. And if it's a centrally-maintained library within an organisation, then only a small number of admin users need to know about the data-fetching internals. Perhaps investing effort in polishing these interfaces is not going to benefit a huge number of people, and it might be better to invest the nicer-function-argument effort at the next level, at the stage of reading data from those files into R. General-purpose packages like Anyway (sorry, that was a bit of a monologue!) none of that argues against splitting the package. But it does encourage some thought as to the best way to build upwards. |
@raymondben thanks! I first read this too quickly and thought we'd have two submissions instead of one which isn't the case. I'll put the submission on hold while you do the splitting, and after that @MilesMcBain & @lwasser can review the core package. |
Thanks for your response @raymondben. Firstly, one small point about the maintenance: With the split design you get much more freedom. You may not even put all your datasources on CRAN, you can choose to leave some on GitHub where the expectations about maintenance are much lower. In this way the maintenance burden could be reduced. Secondly, I did get some of the way through my review before we put this on hold, so I thought it might be helpful to have the higher level feedback now while things are being actively worked on: Following bowerbird.Rmd
This kicked off a 2.5Gb download that blocked my R session which I had to eventually kill. Not a great first user experience ;). I know down the page you point out download size for this source can be reduced. But the total size of the source is not mentioned until the source list at the bottom. I would like to make the following recommendations:
A question: is there a way to check sync status in interactive mode without fecthing these large repos? Programming model
Unlike I would also make the comment that I do not think this is the idiomatic Finally, the |
Thanks @MilesMcBain - a quick question on the config arrangement (and, yes, it is unfortunately easy to unknowingly lose the attributes). The reason for doing it that way is because those attributes apply to the whole configuration (and thus should only be stored once, not in every row) whereas the rows hold data-source-specific information that is not necessarily common across rows. An alternative structure that we considered would be to create a list object that has something like a |
We might commit more deeply to a nested-tibble model for this one-to-many attribute problem, but I'm not sure that covers everything, definitely something that we haven't been sure about so the feedback is very welcome. Also, @MilesMcBain I'm very sympathetic to this user-problem for a 2Gb download, but we are in danger of diluting the entire point of the abstraction of bowerbird - in that it's not ever going to be immediately obvious from a user-perspective why you'd want this package, since the value comes after you've decided to commit to getting all or most or of a particular collection - especially those parts that aren't yet published. Note that while the first-time sync-run is inconvenient, if it successfully completes then when it runs next time, tomorrow, next week it will be hugely fast as it will only check to see that it already has all but the handful of brand-new files from the source, and it will dutifully obtain those new files without any intervention by a human. We could choose a weekly SST data set, that would be easily downloadable in a single file - the initial commitment for download is minutes or less - and the real value of bowerbird will become apparent next week when that file is updated by the provider, and the automated bowerbird process (that you configure to run nightly, or so) kicks in and detects a change. But, not many or most users will care about SST as a variable of interest, and those that do will want much higher resolution in time and space, before they care. So, by definition and rather unfotunately the examples that are most compelling and immediately-reproducible are not going to have an actual audience, though they should have an abstract appeal in terms of what else can be done with this tool that is "relevant to me". I do want to ensure we discuss this audience definition, since we are ultimately targetting people like us, those that commit to a large and continuously maintained collection for shared use. (We are also trying to highlight that committing to this is very valuable, an actual game changer for the way we think and approach these issues). I definitely agree about the "yesno" prompt for whether you really want to download a potentially huge collection, but again that interactivity, normal for "usual use" kind of gets in the way of the task of maintaining a huge collection of automatically updated data. We do see huge value in specific package applications, to provide smarts about available dimensional and variable filters, but we unavoidably end up in very specific areas. I am a bit wary of illustrating the value of bowerbird by exploring a particular data set to the level of actually reading it because as soon as we go that far we've moved away from the bowerbird abstraction of being an automatable, configure-anything, data-getter. Thanks for the discussion and feedback! (I do think the weekly SST example is a good one, and @raymondben and I should look carefully at a user-case for illustrating a full workflow - and probably we need to tell the story of the pending updates that an easily-configured bowerbird robot will sort out for us tomorrow, next week, in the months ahead). |
@raymondben @mdsumner just checking in. 😸 Have you made enough progress on the splitting for @MilesMcBain & @lwasser to soon review the package? Thanks! |
Hi guys, sorry, we've been held up - in part by a bug in sys but Jeroen has kindly resolved that now. I don't think we're far off, hopefully get back to you this week. |
Ok thanks for the update! |
Sorry for the hiatus, think we're OK to resume the review process now. The major change is that the data source definitions have been removed, except for a few representative examples. (The full set of definitions has moved to https://github.com/AustralianAntarcticDivision/blueant. This is a separate package and not part of this ropensci submission, but it's there should you want to get some idea of how a themed-data-package might make use of bowerbird.)
Some replies to Miles' partial review:
Yah, fair enough. The example is now a small one, and in addition we've switched from using
No, it's not exclusively this. See updated comments in README. Also, we can't give progress indicators (also discussed in README).
I'm not sure what you mean by this, sorry.
That structure has now changed and doesn't rely on attributes to carry the repo-wide settings. All exported function names are now prepended with Update: fix found for Travis issue. |
Thanks @raymondben! @MilesMcBain & @lwasser, you can now proceed with the review. 😺 |
Note: I've updated the reviews due date to the 2017-10-30. |
Thanks for the update @raymondben. I look forward to reviewing the new version. Regarding this comment:
Since the package is intended to be used interactively I thought it would be helpful if I can check I am on the latest version without initiating a large download. If it turns out I need to sync, I'd probably like to tee it up in separate session so that it doesn't block my main work environment. It might be this is already possible and I did not see the function. |
Hi @maelle - think we've done most/all of the major code changes (see postrev branch), now a matter of revising all documentation & vignettes. Unfortunately this isn't high priority work right now, so it may not be fast turnaround ... |
@maelle actualy probably not far off now, we have given it a good thrashing in the last few days. In the next week or so, with a bit of luck? |
@raymondben @mdsumner just checking up on this submission, do you think you'll soon be able to give an update to reviewers? Thank you! 😺 |
Response to reviews@MilesMcBain @lwasser - thanks again for your reviews. Responses to each of your points appear below. Please note that the revised version of bowerbird is in its postrev branch. It will be merged into master once the revisions are complete and dependent packages have been updated to match. Other notes: there is still not a lot of documentation to help power users who want to write their own handler functions (there is a brief section in the vignette). However, since this is likely to be a vanishingly small number of users we don't feel it's a showstopper right now. There is an issue for this though so that it doesn't drop off the radar (ropensci/bowerbird#15). Leah - Review Comments
Various changes have been made to address this, notably the "Users: level of usage and expected knowledge" section has been added to the README and vignette, along with an expanded set of examples and guidance on writing data source definitions.
Yup. All function documentation has been revised.
Done.
Ooops, should have been https, sorry. Whatever browser you were using should have redirected, but obviously it didn't. (Changed)
Style is to some extent personal preference (indeed, the tidyverse style guide itself says that "All style guides are fundamentally opinionated ... Some decisions genuinely do make code easier to use ... but many decisions are arbitrary."). Unless it's mandatory for ropensci (we don't believe that it is?) we would prefer not to follow it.
All the wget-related stuff has been revised with more platform-specific instructions, and clear notes on
We do that check in
That's unfortunately the nature of the beast. For what it's worth, we don't particularly enjoy the complexity either. But in order to be a versatile, flexible tool that can cope with arbitrarily-structured repositories from different data providers, there are a lot of arguments to cover different situations.
The vignette now has four examples of writing a data source, stepping through various levels of detail and demonstrating all three of the packaged handler functions.
Not that we are aware of.
As far as we are aware there is no way to have separate sections of arguments with roxygen. However, each mandatory function parameter here is clearly labelled as "required".
The use of
Documentation has had a rewrite, as mentioned above. We feel that detailed tutorials on using data provider websites are beyond the scope of package documentation. Nevertheless, the vignette now includes a fuller example of writing an Earthdata data source. For the record, that URL you give above is your data source URL: it points directly to that data set.
Yes,
Bowerbirds (real ones) collect shiny things to decorate their bowers with. Bowerbird (the package) lets you collect nice shiny data sets and add them to your collection.
Bowerbird just downloads whatever the data provider makes available, following whatever the data source configuration parameters are. As explained in the vignette, bowerbird doesn't know in detail about the data structures or anything else. If they change, the new version will be downloaded.
Each data source has a
A
Yes, as a general statement it's probably reasonable to encourage users to avoid individually downloading large data sets (especially multiple individuals within the same institution downloading the same large data set). But at some point someone has to store data locally. For example, we routinely run analyses that require the entire satellite data record of some parameter (e.g. sea ice or SST). This can't be done by issuing remote data requests, it would be cripplingly slow. Yes, we can run these analyses "in the cloud", and we have virtual servers for exactly this purpose ... but then where does that virtual server get its data from? That data storage ideally needs to be on the same local network as the compute facility so that access is fast. Someone, somewhere, has to maintain that collection of data files. This is one use case for bowerbird. And having one data librarian maintain a collection of data sets on behalf of a number of local users will reduce the need for individuals to download their own copies.
The bugreports URL is in the DESCRIPTION file. Maintainer is also there (but it's hidden! The "cre" (creator) role in Authors@R gets interpreted as maintainer). CONTRIBUTING.md and code of conduct files have been added.
No, it's not. (You can see that in the submission details in the onboarding issue - I don't know if as a reviewer you get any special notification of that though).
There is no NEWS file yet: to date, the package has been in an active development stage with everything liable to change, so it doesn't make sense to document all this via NEWS. We will add a NEWS file once the onboarding process is complete and things seem stable enough to warrant it. Miles - Review Comments
It now has a subheading.
See response to Leah's review above - the README/vignette now has four examples of increasing complexity, and demonstrating all three of the packaged handler functions.
General improvements all round have been made.
Function documentation now explains its role a little more clearly.
This function is now not exported - it is useful internally but probably not for the user.
For these functions, as well as the handler functions (bb_handler_wget, bb_handler_oceandata, bb_handler_earthdata) there are several arguments that are passed automatically when bb_sync is run, and which don't need to be specified by the user. Previously these were exposed as regular function arguments, which was indeed confusing. These have now been masked from regular users by use of the dots argument. Note that developers who wish to write their own handler functions will need to know about these arguments: there is a section of the vignette that will explain this. But regular users, who will almost certainly be the overwhelming majority of the user base, will find the new structure much less confusing. The documentation for the handler and other functions that are called by bb_sync now also makes it clear that these functions are not intended to be called directly by the user, but specified as part of a data source definition.
Data provenance, and frameworks to help manage it, are evolving areas. The
Is essentially an internally-used argument, which (like the config and other automatically-passed arguments to handler functions) has now been hidden from the user.
Function documentation has in general been improved, and the vignette examples give more detail.
See above: now hidden from the user.
dplyr has been ditched!
Now removed: the parts of data source definitions that need to specify functions (and possibly arguments) now do so via a
The
We like your optimism! wget still gets us scratching our heads on occasion. But for all its faults and complexity, it is a powerful tool.
There are a number of wget flags that can clash, and other than trying them all in combination we are not aware of any way of knowing how those clashes are resolved. That's really an issue with wget itself.
Added.
This is a little more complicated than it may seem. First, the majority of these There are, however, some instances where we use For functions outside of the We've gone through all usages of
We've added a few more error path tests, and will continue to add appropriate tests to the suite as development progresses.
Actually there is - the "cre" (creator) role in Authors@R gets interpreted as maintainer. |
Thanks! @lwasser @MilesMcBain could you please have a look at the answer? Reg. the |
hi @maelle i think (without actually going through another review) that their feedback looks very thorough and patient considering all of the review comments :) i also didn't know about |
Thanks a lot @lwasser! Yes @raymondben I forgot to tell you to add the rOpenSci review badge to the README! 🙈
|
Hi All, just making a note that I'm picking up this up again now. I should be able to respond fully within the next week. 👍 |
Thanks @MilesMcBain ! |
Thanks @raymondben and @mdsumner! An excellent job on the documentation and vignettes. The main vignette is now a stand out - one of the best I have seen. It gives this package a great chance of getting some use about the place. 👏 😄 I am also happy with the way you addressed my other comments. Thankyou for the explanation on the use of I've updated my review block above with the final ✔️ One minor comment: When reviewing some of the changes in the code I noticed a lot of old code sitting around in commented out blocks. This was true in maybe two thirds of Otherwise, congratulations on polishing up this package and it has been my pleasure to review it. |
Thanks indeed @MilesMcBain. Re: commented-out code, yes, I'll take the blame for that, I do rather have a tendency to leave it littered around. I'll have a purge ... |
Thanks a lot @MilesMcBain! @raymondben please update this thread when you've done that, so that I might take a last look before approval. Reg. purging comments you could count the number of lines you've suppressed by using |
@maelle , I've already cleaned them out. (master branch) |
Awesome, I'll have a look later today/this week! |
I have started looking, really great docs as Miles say! I was wondering whether it'd be good to split the README into a more minimal README and a few vignettes linked from the README? E.g. "Defining data sources" could be a vignette. Or add a table of contents at the top of the README? (but the website might still be easier to browse?) |
Approved! 👏 Thanks a lot @raymondben @mdsumner @lwasser @MilesMcBain for your work! A very productive review process IMO! I have three suggestions:
Now here is the list of things you have to do before I close this issue 😉
Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing. |
@raymondben @mdsumner We'd love to publish a blog post about Was just looking back at a discussion of "the journey" from code for my own use, to code that I want others to find useful, where @noamross suggested a blog post using Either way, this is optional and only if you have the capacity and interest to do this. Discussion reminded me that I need to think about how we can help authors and reviewers with this process Let me know what you think. No rush. |
👋 @raymondben @mdsumner could you please soon do the different items of the checklist above including transferring your repo? Thanks! |
@maelle - done. Sorry, was waiting for a revised footer to be finalized, but looks like that may take a while so I've gone ahead and transferred now. |
Cool, thanks! Could you also add the review badge mentioned in the checklist? I've activated the repo in Appveyor. |
Note that the badges do not render now, but this issue will soon be fixed, so please add it to your README before I close this issue. :-) |
The review badge is in the readme, just not rendering. It was showing under our org, but not now. |
Aaah thanks and sorry! Perfect! I'm closing the issue but this doesn't prevent the discussion of blog posts to continue. 😸 |
Summary
A package for maintaining a local collection of data sets from a range of data providers. Bowerbird can mirror an entire remote collection of files, using
wget
's recursive download functionality. Bowerbird also provides some functions around data provenance and versioning (it doesn't fundamentally solve these issues, but goes some way towards solutions).URL for the package (the development repository, not a stylized html page):
https://github.com/AustralianAntarcticDivision/bowerbird
Please indicate which category or categories from our package fit policies this package falls under *and why(? (e.g., data retrieval, reproducibility. If you are unsure, we suggest you make a pre-submission inquiry.):
Data retrieval, but also venturing into reproducibility. Primarily it is intended as a mechanism for maintaining a local data collection (from remote data providers), but could also be used as a wrapper to allow others to reproduce your work (e.g. "you'll need these 100GB of files installed locally; here's the bowerbird script to do so"). Bowerbird also has a few functions to help with data provenance, see
vignette("data_provenance")
Who is the target audience?
Research scientists/technicians/data managers who want to maintain a local library of data files (either for their own use, or perhaps a single shared library on behalf of a number of local users, as we do). Researchers who want to share work that relies on local copies of data. Also potentially package developers who need some sort of data retrieval that isn't easily accomplished by existing tools (e.g. recursive download of a whole collection of data files from a satellite data provider).
Are there other R packages that accomplish the same thing? If so, how does
yours differ or meet our criteria for best-in-category?
Nothing to our knowledge that really does the same thing. Some similarity to https://github.com/ropensci/rdataretriever, though rdataretriever seems to be angled towards biodiversity data sets in particular and creating sensible local database structures for them. Bowerbird is focused on mirroring remote data to a local file system, and providing some functions around data provenance. Passing overlap with http packages (httr, crul) but these are generally intended for single-transaction sort of usage. Jeroen's
curl
package (not an ropensci one?) is also similar to bowerbird in that it wraps a underlying http client: bowerbird typically useswget
under the hood to accomplish its web traffic, whereas curl binds to libcurl. AFAIK curl doesn't support mirroring of external sites (whichwget
does, and which bowerbird relies heavily on).Requirements
Confirm each of the following by checking the box. This package:
Publication options
paper.md
with a high-level description in the package root or ininst/
.Detail
Does
R CMD check
(ordevtools::check()
) succeed? Paste and describe any errors or warnings:[more or less] Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:
cat()
for printing progress, despite the packaging guide suggestions to usemessage
instead. This is because (a) progress information doesn't really strike me as a "condition", whichmessage
is intended for, (b) allcat
-issues messages can be turned off by specifyingbb_sync(...,verbose=FALSE)
, (c) an anticipated common use for bowerbird is for unattended (cron-job) updates to a local data library, in which case it's likely the user will want tosink()
all output to a log file. Usingcat()
means that a simplesink()
will catch everything, including output fromwget
calls (if they are made). I think this becomes less reliable ifmessage
is used (you'd have tosink(...type="message")
but then I'm not sure it'd catchwget
output, but admittedly haven't tried this)If this is a resubmission following rejection, please explain the change in circumstances:
If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:
Additional notes re: our presubmission enquiry
General note: the package is not in a final polished state, but we think far enough advanced (and stable enough) to be a good point for onboarding consideration.
Fair suggestion, and one that we've considered - and maybe that split is still reasonable to consider down the track. But for now, at least, we think it's better to keep core-functionality and the data-source-definitions bundled together.
It's up to the user where they want to put their data. We do make a suggestion in the README and vignette for users to consider rappdirs.
The text was updated successfully, but these errors were encountered: