-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GForce shutoff for non-standard classes #3546
Conversation
Two old tests were altered/eliminated here; see comments on each |
DT = data.table(id=1:2, val1=6:1, val2=6:1) # 5380 | ||
test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD") | ||
#DT = data.table(id=1:2, val1=6:1, val2=6:1) # 5380 | ||
#test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this ever should have been an error
... Behavior in this PR is to degrade from trying to GForce
sum(.SD)
and erroring to just never trying to GForce
in the first place.
The error is nice in that it provides feedback that the user may be doing it wrong...
Potentially we could add some more checks for this specific case and continue to pass this test?
Codecov Report
@@ Coverage Diff @@
## master #3546 +/- ##
==========================================
+ Coverage 97.45% 97.45% +<.01%
==========================================
Files 66 66
Lines 12801 12809 +8
==========================================
+ Hits 12475 12483 +8
Misses 326 326
Continue to review full report at Codecov.
|
@@ -7039,6 +7039,11 @@ test(1521, x[, b := 5], data.table(a=c(1,2), b=5)) | |||
|
|||
# Fix for #1160, fastmean retaining attributes | |||
x = data.table(a = c(2,2,1,1,2), b=setattr(1:5, 'class', c('bla', 'integer'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is building explicitly wrong (IMO) behavior by expecting fastmean
to run on bla
-class objects.
Essentially it's forcing mean.default
to be run on all classes & then returning the object with the old class... As we've seen in #3533 etc, this implicit bypassing of the method dispatch can be ill-advised.
IIUC that's why mean.default
will coerce to numeric
instead of retaining c('bla', 'integer')
-- especially the integer
part feels explicitly wrong.
x = data.table(a = c(2,2,1,1,2), b=1:5)
setattr(x$b, 'class', c('bla', 'integer'))
class(mean(x$b))
# [1] numeric
The original issue of #1160 would have been fixed by this PR as well (just not using fastmean
), so leaving the test in and adjusting by providing a mean.bla
method. @arunsrinivasan made the original PR, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only that there is one pending issue related to fast mean #2799
But that issue is very unusual in that mean(c(20,350)) is not 185 but 5 (in the The cause of #3079 is we're missing a |
I think I disagree pretty strongly here, this leads to bad user experience, and also there's no guarantee for
I believe the approach here is much more user-friendly -- expert users can Is there a compromise approach? Perhaps using this PR but also issuing a |
This PR turns off gforce optimization for any object with a class, doesn't it? Just because of one case (
I wrote that the cause of #3079 is we're missing a copyMostAttrib in gmedian which is inconsistent with almost all the other gforce functions which already do that copy. On #1876 I don't quite follow that one yet, but I don't think it's proof (yet) that gforce needs turning off for any vector with a class. I mentioned both here: #3533 (comment). Assuming #1876 is different, it's one case (#3533, circular) not 3.
I think this is a bit unfair to write. |
Yes, but we also don't really know that
Only a few of those are exported:
"the mean is not the mean" is not for us to decide. It's very heavy-handed of us to bypass method dispatch and say "your class wanted to do a mean this way, but trust us, the arithmetic mean is what you want". It's easy to imagine creating a class that would default to using geometric or harmonic means, for example, which are perfectly valid "mean"s, statistically speaking.
It was strongly worded & a bit flip, but the kernel holds true. Bypassing method dispatch for speed automatically is saying that accuracy is not paramount; speed is. We spent a lot of effort in #3056 trying to prevent users from potentially getting silently wrong results; does that diligence not apply here? |
Agree with Michael that we should not automatically ignore method dispatch. It will be more safe and still user friendly if we would export |
You realize this is big change in ethos and will require a lot of people to change. Do I understand your suggestion correctly? How is it still user friendly to require (a lot of) users to start to use new
Because nobody has proposed a solution yet that doesn't have a greater cost. The title of this PR refers to "non-standard" classes, but really it shuts off gforce for any class, for lots of standard classes. In general, data.table does break some eggs to make an omelette. For example, fastmean was probably 10 years ago, and gforce maybe 5 years old now. Both have always done it this way and in all that time it's only now recently we get one report where it doesn't work: How about adding a blacklist of classes for which gforce should be turned off. |
What I meant is that |
I'm not seeing the cost that much. We also only have one test in the suite that relies on Getting Another alternative is to use
The worry is still that unaware users can end up with silently wrong results. Whereas with |
I'm not following what you mean by "non-numeric class". This issue and PR isn't about character data per se, for example. is.vector() is used in this PR for its feature of considering a vector with a class attribute not be a vector. Maybe you mean non-vector class rather than non-numeric class. The fact we're missing tests just shows that we thought one test was sufficient to make sure class attribute was retained by gforce functions. But over time the gforce abilities have been extended and one (gmedian) was missed. So we need more tests.
revdep wouldn't show anything. revdep is for correctness (so much as revdep package tests cover). But this cost would be a slow-down.
I'm not following what you mean here. That sounds to me like they would have to know and then do something new manually. Currently gforce is an automatic optimization. It's not about the compute speed of having to do
Again - I don't know what this means. What is an entry point for GForce ?
Not if
That's not exactly true. It's much better now that |
we could edit this branch to error instead of setting
Yes. I'm fine with this personally (perhaps pending some numeric evidence via the revdep check)
Another case of users having to know and then doing something manually. This time, instead of |
Closes #3079
Closes #1876
Closes #3533