Adding `assertr` to ROpenSci #23

tonyfischetti · 2015-12-24T19:17:53Z

1. What does this package do? (explain in 50 words or less)
  The assertr package supplies a suite of functions designed to verify assumptions about data early in an analysis pipeline to protect against common data errors and instances of bad data.
1. Paste the full DESCRIPTION file inside a code block (bounded by ``` on either end).

Package: assertr
Type: Package
Title: Assertive Programming for R Analysis Pipelines
Version: 1.0.0
Authors@R: person("Tony", "Fischetti", email="tony.fischetti@gmail.com",
  role = c("aut", "cre"))
Maintainer: Tony Fischetti <tony.fischetti@gmail.com>
Description: Provides functionality to assert conditions
    that have to be met so that errors in data used in
    analysis pipelines can fail quickly. Similar to
    'stopifnot()' but more powerful, friendly, and easier
    for use in pipelines.
URL: https://github.com/tonyfischetti/assertr
BugReports: https://github.com/tonyfischetti/assertr/issues
License: MIT + file LICENSE
LazyData: TRUE
Imports:
    dplyr,
    MASS,
    lazyeval
Suggests:
    knitr,
    testthat,
    magrittr
VignetteBuilder: knitr

The text was updated successfully, but these errors were encountered:

richfitz · 2015-12-24T19:40:33Z

I have a use case for this and have already looked through the code, so am happy to review if that is useful (at the same time I can't review until the 4th January at the earliest as I will be travelling over the break).

sckott · 2015-12-24T19:42:50Z

Reviewers: @richfitz @jennybc

jennybc · 2015-12-24T19:48:48Z

I've been meaning to do one, so I can be the second.

sckott · 2015-12-24T19:51:59Z

thanks jenny, assigned

karthik · 2015-12-24T19:53:12Z

@tonyfischetti Excellent! Thanks for submitting! Looking forward to the reviews and adding this to the suite. 😃

sckott · 2016-01-22T20:30:25Z

@richfitz @jennybc - hey there, it's been 29.0 days, please get your review in soon, thanks 😺

richfitz · 2016-01-22T21:02:12Z

Sorry - I have been meaning to (and also #25). Next week for both I hope.

sckott · 2016-01-22T21:03:23Z

cool - (p.s. that comment was your friendly heroku robot https://github.com/ropenscilabs/heythere)

richfitz · 2016-01-22T21:06:03Z

The descent towards manuscript central begins... 😁

sckott · 2016-01-22T21:07:50Z

unless you're volunteering to remind everyone manually

richfitz · 2016-01-22T21:10:05Z

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

jennybc · 2016-01-22T21:17:20Z

Maybe we should send hand-written notes?

jennybc · 2016-01-22T21:17:39Z

And yes, duly noted, that I need to bust a move on this.

sckott · 2016-01-22T21:22:34Z

Definitely not! I think it's great. I actually thought it was you, which you could never say for MS central.

heythere != MS central

Maybe we should send hand-written notes?

Yes!

richfitz · 2016-01-25T09:45:41Z

General comments

The assertr package provides a generalised framework for defensive programming around data.frames. This sits somewhere between stopifnot and testthat in terms of flexibility and complexity and as such forms a useful building block for data analysis workflow, which I believe is under-tooled at the moment. I really like the idea of having packages that are primarily focussed on data workflows rather than restricting people to ideas that were developed for software engineering (such as formal unit tests).

The package is very tight -- it exports the minimum set of functionality and conforms to the "do one thing and do it well" school of thought. The functions are well documented, the vignette is readable and less dry than most. I appreciate the split into NSE and SE versions of all core functions.

Accordingly, most of my comments focus on design decisions and therefore may all be out of line because the author will have thought about this more than I have.

The main entrypoints are difficult to differentiate

My biggest concern is that I found the three main entry points very difficult to keep straight. And when I put the package down over Christmas I had to re-remember them again.

assert(data, predicate, ...)
verify(data, expr, ...)
insist(data, predicate_generator, ...)

The difference is primarily in the properties of the second argument and I wonder if there's a way of specifying that some other way than three functions that have such similar names? The current approach is extremely elegant but at the cost of being a bit too opaque to the user -- especially because the three function names are essentially synonyms of each other there is nothing to jog your memory. I presume the problem is it is difficult to detect the difference between the three argument types before they are evaluated (and the correct evaluation depends on the type).

The custom handler routine is inflexible

The custom handler is a really great addition, but could be improved. testthat has a similar handler approach that allows storage of a bunch of repeated assertions (pass or fail). My use-case for this package is to replicate something like this, so I'd want to pass the same handler in to all the functions in a pipeline:

mtcars %>%
  verify(nrow(mtcars) > 10, error_fun=my_error_fun) %>%
  verify(mpg > 0, error_fun=my_error_fun) %>%
  insist(within_n_sds(4), mpg, error_fun=my_error_fun) %>%
  assert(in_set(0,1), am, vs, error_fun=my_error_fun) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

What would be heaps nicer is if I could register a handler; change the assert function to something like:

assertr <- function(data, predicate, ..., error_fun=getOption("assertr.handler", assertr_stop)) {
}

(or eqivalently use the package-environment trick like testthat does). This would allow:

options(assertr.handler=my_error_fun)
mtcars %>%
  verify(nrow(mtcars) > 10) %>%
  verify(mpg > 0) %>%
  insist(within_n_sds(4), mpg) %>%
  assert(in_set(0,1), am, vs) %>%
  group_by(cyl) %>%
  summarise(avg.mpg=mean(mpg))

(As a related comment, the usage definitions reference the unexported function assertr_stop which some may find confusing. Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Minor comments

The dplyr dependency, which is used soley for dplyr::select_ seems a potentially heavy dependency for one function; if it is straightforward to swap out for independent imlementation that would decrease the package footprint (I can imagine using this in contexts where I do not have dplyr installed such as container-based workflows). But I can totally see the advantage of sticking with something that is known to work.
Think about the non-pipe people. All the examples and the vignette make heavy use of the %>%operator (which is fine), but as someone who uses this little or who imagines using this mostly in packages where I'd be avoiding so much weird evaluation, I would appreciate a few more pipe-less examples. This is potentially confusing in places like:

our.data %>% assert(within_bounds(0,Inf), mpg) # and so on

when ?assert says that usage is

assert(data, predicate, ..., error_fun = assertr_stop)

This confused me because in usage it looks like predicate, data, but of course the data argument comes from the pipe and the mpg is the column name is passed through to the ... argument. While the use with pipes is very elegant, I think the package has use outside of that scope too. The examples within the package are actually really good like this.

Classed exceptions would make the error handling more flexible. Related to the second major point above; it would be nice to distinguish between errors that were because the input to assertr was incorrect, and errors that are raised because the data failed the assertions. R's classed errors provide a nice framework for this. Then tryCatch and withCallingHandlers can dispatch appropriately based on the sort of error.
A reporting framework would be fantastic If you don't write this, I will -- but given you have written this package I figure you should get right of refusal. Related to the point above, I would like to use the underlying bits you have here in a package for automated testing of upstream data sources that tend to be misbehaved. I can imagine a testthat-like reporting framework where a bunch of tests are run and the failures reported.

tonyfischetti · 2016-02-20T20:15:06Z

Thanks for the kind words and really great feedback @richfitz.

The main entrypoints are difficult to differentiate

As pointed out, this is somewhat a consequence of my making the API as elegant as possible but at the expense of some opaque-ness. The problem is (again, as you pointed out) there's no elegant way that I can see to programmatically detect the argument types. So we have verbs that are literally synonyms (I chose their names using a thesaurus). Unfortunately, I can't think of a better naming mechanism without really long function names like take_a_predicate_generator_and_apply_to_each_column(). I'm open to suggestions, but it may not be the end of the world since there are only three main verbs and the docs are good.

The custom handler routine is inflexible

I like the idea of using a testthat style options mechanism. I'll implement this

Additionally, is there a reason why verify uses error_fun=stop not assertr_stop?)

Nope, that's an error. Good catch!

The dplyr dependency

That was a difficult choice. It's a hell of a heavy dependency, but I was wary of implementing dplyr::select myself. Especially since I would have to reference dplyr's implementation so heavily that it would likely constitute code theft. I'd love to drop the dependency though, if anyone thinks they can implement it without copying code.

A reporting framework would be fantastic

I need this feature, and I have a few great ideas on how to implement it. I'd like to talk to you more (@richfitz) about your particular use case in case my solution only suits my use case. I think this feature can potentially be the most powerful and useful capability of assertr

tonyfischetti · 2016-03-12T22:39:37Z

So I'm going to run into a little more free time in the near future and I'd like to get back from my learning hiatus back into improving assertr--particularly because people are telling me its really useful for them. Because of the learning hiatus, I have some fresh new ideas for improvement, but I'd like to run them past some of you for further input...

The main entrypoints are difficult to differentiate

As mentioned before, there is assert, insist, insist_rows, and assert_rows for representing a wide range of tasks. However, if I wanted to add the ability to, say, declare that the whole data set should have no more than 15% missing values (and I do want to add that), the semantics of that would require another specialized function... and I'm running out of synonyms for "assert"!

Principled though that solution was, I'm not sure its the correct thing to do going forward; it's always been one of R's strengths from a user's point of view to use a familiar generic function (mean, plot, etc...) with all sorts of input and have the object system dynamically dispatch the correct functionality.

So how about this... creating an S3S4 generic function (proclaim perhaps?) that will handle all of the different semantics for the user. Concretely, the function returned from within_n_mads can be labeled with class assertr_dynamic (because the predicate is dynamically generated). Then, proclaim would dispatch on the second argument (the first arg is the data frame) and call what is currently referred to as insist. In the same way, maha_dist would be classed something like assertr_dynamic_rows and a user calling proclaim(df, maha_dist, within_n_mads(10), ...) would transparently dispatch insist_rows(maha_dist, within_n_mads(10), ...)

This would improve assertr's extensibility greatly; for example, adding semantics to check the supplied data.frame as a whole would only require writing another S3S4 method of the proclaim generic and making sure that the predicate function was correctly classed.

This solution is perhaps a little unconventional, but it completely obviates the requirement that a useR remember all the verbs.

The custom handler routine is inflexible

The most common suggestion for assertr is to be able to warn (not error) on violation. Additionally, even if it does error on violation, there should be semantics in place for the entire chain of assertions to run so that the final error message will contain the complete report of the data errors.

The reason I needed to take my time with this one is that I need the warnings to be concatenated through a assertr chain in a principled manner. To do this without using dynamic variables (eww), it requires that--along with returned the data frame given--the assertr verbs need to return the warnings so that they can be concatenated with the warnings further in the assertion pipeline. Up until recently, I thought that I would have to implement something that would be tantamount to a Haskell monad in order to do this. Another possibility is to use S4 (hence the S3 strikeouts in the paragraphs above) in order to get proclaim (or whatever it is) to dispatch on both the second argument, and the first argument. If the first argument is a data.frame the current semantics stand... if the first argument is something else (some class that holds a data.frame and a running list of errors) then the wanted behavior can be dispatched.

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

I'd appreciate any feedback on these ideas for two reasons (a) it's now (or will hopefully be soon) ROpenSci's project not just mine, and (b) I'd like to get the input of some talented developers :)

tonyfischetti · 2016-03-12T23:09:21Z

To review, none of the proposed additions have to break backwards compatibility :) It would just make everything much easier for the user.... the example in the README would go from this:

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      insist(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert_rows(num_row_NAs, within_bounds(0,2), everything()) %>%
      insist_rows(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

to this

  mtcars %>%
      verify(nrow(.) > 10) %>%
      verify(mpg > 0) %>%
      assert(within_n_sds(4), mpg) %>%
      assert(in_set(0,1), am, vs) %>%
      assert(num_row_NAs, within_bounds(0,2), everything()) %>%
      assert(maha_dist, within_n_mads(10), everything()) %>%
      group_by(cyl) %>%
      summarise(avg.mpg=mean(mpg))

(verify wouldn't be able to be replicated under a assert S4 generic)

jennybc · 2016-03-12T23:12:12Z

Since I'm the one who is yet to review ... @tonyfischetti do you have recommendations of a good dataset to run through assertr? I.e. one that you think should show it off but ... there's enough uncertainty that it would be interesting to see how things go? Also to see how a new user manages with it. I have one idea that I'll fall back on if nothing immediately comes to mind.

tonyfischetti · 2016-03-12T23:17:00Z

@jennybc Nothing immediately comes to mind but I'm sure I can dig up one of the examples that inspired me to get into this package in the first place :)

jennybc · 2016-03-12T23:41:55Z

@aammd politely reminded me I have some really ugly data asserting/cleaning code in the private STAT 545 instructors repo, so that's my plan B 😬.

aammd · 2016-03-13T00:20:03Z

@jennybc wellll i would not say ugly, but rather "pre-assertr". it represents "how we did this before assertr" and highlights improvements in the UI of this package (improvements I tried to show in my lesson about assertr)

sckott · 2016-03-14T23:19:34Z

@tonyfischetti some thoughts on:

The big complication is what happens at the end of the chain. There needs to be something that tells assertr that the chain is ending so that it can take the data.frame out of the composite data.frame/error_log object and finally display the error or warning. Any ideas?

with help of @smbache - we have a way to detect whether a piped command is the last one or not. If it is the last, do X (e.g., execute some other fxn, print data, etc.) instead of passing to the next command. You can see the helper fxns here https://github.com/ropensci/jqr/blob/master/R/pipe_helpers.R and usage here https://github.com/ropensci/jqr/blob/master/R/index.R#L42

sckott · 2016-03-22T00:30:22Z

@jennybc - hey there, it's been 89 days, please get your review in soon, thanks 😺

jennybc · 2016-03-22T04:22:56Z

OK I promise you I will not be able to face you an unconf w/o this being totally done.

sckott · 2016-05-23T14:43:38Z

@jennybc - hey there, it's been 151 days, please get your review in soon, thanks 😺 (ropensci-bot)

smbache · 2016-05-23T15:14:59Z

It must be the temptation of ensurer that is causing delay 😆 Hehe

jennybc · 2016-05-23T15:33:08Z

I am, and have been, halfway done for ages. I have a PR ready for the vignette. It's the incredibly insightful overall comments that need to be written. 😳 Will do.

tonyfischetti · 2016-05-25T14:13:57Z

@smbache I didn't know ensurer was being considered

smbache · 2016-05-25T14:14:36Z

It's not. I was joking ;)

sckott · 2016-05-31T00:31:05Z

@jennybc - hey there, it's been 159 days, please get your review in soon, thanks 😺 (ropensci-bot)

sckott · 2016-06-08T16:50:43Z

@tonyfischetti approved!

Add the footer to your README:

[![ropensci\_footer](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)

Update installation of dev versions to ropenscilabs/assertr and any urls for the github repo to ropenscilabs instead of tonyfischetti
Update any links to the package from tonyfischetti/assertr to ropenscilabs/assertr (though even if they aren't github will redirect to the new location :) )
Go to the Repo Settings --> Transfer Ownership and transfer to ropenscilabs - Note that all our newer pkgs go to ropenscilabs first, then when more mature we'll move to ropensci

tonyfischetti · 2016-06-14T21:14:47Z

I tried to do the last thing and it says I don't have admin rights to ropenscilabs :(

sckott · 2016-06-14T21:17:03Z

you should have received an invitation from ropenscilabs, did you get that email?

tonyfischetti · 2016-06-14T21:21:07Z

Idiotic move on my part not checking the mail :) It's done

sckott · 2016-06-14T21:30:07Z

sweet!

sckott added package 2/seeking-reviewer(s) 1/editor-checks labels Dec 24, 2015

karthik removed the 2/seeking-reviewer(s) label Dec 24, 2015

Robinlovelace mentioned this issue Jan 25, 2016

Use assertr to ensure correct data input npct/pct-scripts#27

Closed

sckott mentioned this issue Mar 22, 2016

Be smart about detecting if a review has been submitted ropensci-archive/heythere#4

Open

sckott added the 6/approved label Jun 8, 2016

sckott closed this as completed Jun 14, 2016

noamross added topic:reproducibility and removed 1/editor-checks labels Jan 17, 2017

noamross assigned sckott Jan 17, 2017

iqis mentioned this issue May 2, 2019

Submission to rOpenSci? wendtke/psyphr#15

Closed

Adding assertr to ROpenSci #23

Adding assertr to ROpenSci #23

Comments

tonyfischetti commented Dec 24, 2015

richfitz commented Dec 24, 2015

sckott commented Dec 24, 2015

jennybc commented Dec 24, 2015

sckott commented Dec 24, 2015

karthik commented Dec 24, 2015

sckott commented Jan 22, 2016

richfitz commented Jan 22, 2016

sckott commented Jan 22, 2016

richfitz commented Jan 22, 2016

sckott commented Jan 22, 2016

richfitz commented Jan 22, 2016

jennybc commented Jan 22, 2016

jennybc commented Jan 22, 2016

sckott commented Jan 22, 2016

richfitz commented Jan 25, 2016

General comments

Minor comments

tonyfischetti commented Feb 20, 2016

tonyfischetti commented Mar 12, 2016

tonyfischetti commented Mar 12, 2016

jennybc commented Mar 12, 2016

tonyfischetti commented Mar 12, 2016

jennybc commented Mar 12, 2016

aammd commented Mar 13, 2016

sckott commented Mar 14, 2016

sckott commented Mar 22, 2016

jennybc commented Mar 22, 2016

sckott commented May 23, 2016

smbache commented May 23, 2016

jennybc commented May 23, 2016

tonyfischetti commented May 25, 2016

smbache commented May 25, 2016

sckott commented May 31, 2016

sckott commented Jun 8, 2016

tonyfischetti commented Jun 14, 2016

sckott commented Jun 14, 2016

tonyfischetti commented Jun 14, 2016

sckott commented Jun 14, 2016

Adding `assertr` to ROpenSci #23

Adding `assertr` to ROpenSci #23