-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
forderv handles complex input #3701
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3701 +/- ##
==========================================
+ Coverage 98.3% 98.45% +0.15%
==========================================
Files 69 69
Lines 13267 13284 +17
==========================================
+ Hits 13042 13079 +37
+ Misses 225 205 -20
Continue to review full report at Codecov.
|
Have taken this pretty far but could go for some help to go further. @mattdowle could you weigh in on how to proceed with
Any thoughts? |
@@ -51,14 +51,9 @@ setkeyv = function(x, cols, verbose=getOption("datatable.verbose"), physical=TRU | |||
} | |||
if (identical(cols,"")) stop("cols is the empty string. Use NULL to remove the key.") | |||
if (!all(nzchar(cols))) stop("cols contains some blanks.") | |||
if (!length(cols)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if
is redundant & contradictory -- just a few lines earlier (L47) has if (!length(cols))
as a warning
. I guess this branch was some leftover or copy/paste from setkey
... revealed by Codecov
Ok so this is a lot of progress! But it's risky. There's a lot of core files changes here, and as we've seen from #3692 (comment), adding support for the new type can have unintended and hard to track memory faults because it's a new type that's bigger too (16 bytes). The new I haven't seen any requests from actual users who want to order, group and join on The other aspect is rather than adding Can we close this PR? We can support complex in value columns now, but not join-on-complex or group-by-complex until a user provides a use case that needs grouping or joining on complex columns. The issue can be left open with this distinction clear. |
I hadn't seen this from @eliocamp before I wrote the comment above. |
@eliocamp The info in your comment linked above was very helpful, thanks! In that |
agree this is the heaviest new piece of code--> should think about potential maintenance burden. also note that #1444 is closed here... perhaps this PR could be scaled back to do forder only? that part doesn't require the by/joins/unique to work, it simply creates a new sorting object based on the real&complex parts. inefficient, but convenient. this commit did so at the R level: 2e8c996 this commit moved that logic to C: 52bcac0 PS also agree it would be helpful as a backup to default to |
Major scaling back would be good. The reorder.c change here would be good to merge as-is though. If |
the reorder.c change was originally another PR but I needed part of it for the bmerge/uniqlist stuff. I'll reopen that one to merge separately. |
for the n_cplx branch, 12 lines or so are comments. a lot of the rest is duplication -- apply to real part, apply to complex part. but the idea is to reconstruct a new DT (in order of the original |
@mattdowle OK, have scaled back by removing the Besides the large-ish part for
Will file (and close) a follow-up PR with the rest as a starting point if needed in the future, so the commits here can be squashed |
Yes, its eminently reasonable. Complex values are probably exclusively used as measured values and not category-like so I don't imagine there are many reasonable workflows that would include a |
src/forder.c
Outdated
} | ||
DT = new_dt; | ||
ascArg = new_asc; | ||
by = new_by; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about this branch. I looks constrained to only when n_cplx>0 (which is good that it's isolated), and I know what it is doing (splitting complex into real and imaginary columns) but I don't see how it's doing it. I'm going to need more time to review it and make sure it doesn't impact anything unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when i first took a look i decided against that (I think my initial reaction was range
doesn't make sense for complex, but of course it does because they're well-ordered using the lexicographic rule). will take another look.
The n_cplx
branch is a bit messy, and took me several iterations to get there as I kept failing a few more tests each time. I think it can be cleaned up a little by snazzier use of the j
index.
Idea is: receive by
which is say length n
, m
of which are complex columns.
Instead of sorting using the input list DT
, do so using a new list (constructed as new_dt
).
If there are no complex columns, new_dt[[1]]
is DT[[ by[1] ]]
, new_dt[[2]]
is DT[[ by[2] ]]
, and so on.
If there are complex columns, say the first is at the third index of by
, then new_dt[[3]]
is Re( DT[[ by[3] ]] )
and new_dt[[4]]
is Im( DT[[ by[3] ]] )
. Then continuing on as usual. That "bump" of the index by a complex column is what made things complicated.
Similar thinking applies to the re-mapped ascArg
; by
simply gets re-mapped to seq_along(new_dt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't thinking of doing a single range for complex
. The thought was to trick the loop into doing a complex column twice. The first time doing range for the real part only. The 2nd time calling range for the imaginary part only. There might need to be new range_complex_r and range_complex_i perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other option is to postpone this PR to a later date. Elio's response above confirms that setkey-on-complex and grouping-on-complex isn't high priority. It was complex
value columns that was the main thing; which you did and have been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest impediment to that is I don't see a way to get a double *
object pointing to the real/imaginary vectors (like REAL(COMPLEX(x))
or COMPLEX(x).r
, hope that makes sense), which is what all of the functions for REAL
columns take as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would nice for completeness to get this merged and closed #1703, yes. But I'm also an eye that we've both been working on complex number support for 3 days now (elapsed) and continuing. And Elio said going this far (sorting complex) isn't needed. There isn't any demand for this from users as far as I can see.
True... I'm just loath to close the PR given that it's passing tests. Also understand the tradeoff that is the maintenance burden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest impediment to that is I don't see a way to get a double * object pointing to the real/imaginary vectors (like REAL(COMPLEX(x)) or COMPLEX(x).r, hope that makes sense), which is what all of the functions for REAL columns take as input.
Yes makes sense. That's why I wrote above: "There might need to be new range_complex_r and range_complex_i perhaps." But I didn't look at it closely.
maintenance burden
It's not a burden in the sense that more code is just always generally bad. It's that this particular change could cause hard to trace bugs because of the type of change it is to such a core function. It's risk. And it's time in reviewing it.
it's passing tests
But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this shows the tests aren't sufficient: the cast isn't correct but it's still passing tests.
Very true, I meant more that it's passing existing tests (so "no" existing functionality is being affected).
Would adding a lot more tests to the suite for the new code be a good direction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A good direction would be to focus on the most requested issues: #3189
src/uniqlist.c
Outdated
case CPLXSXP: { | ||
// tried to make long long complex * but got a warning that it's a GNU extension | ||
Rcomplex *pz = COMPLEX(jcol); | ||
same = (long long)pz[i].r == (long long)pz[i-1].r && (long long)pz[i].i == (long long)pz[i-1].i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (long long)
is a cast and won't achieve the desired op. This cast will trunc 3.4 to 3 for example. What's needed is a type pun, which is what the REALSXP case just above does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify in the REALSXP
branch:
long long *ll = (long long *)REAL(jcol);
same = ll[i]==ll[i-1];
What's happening here is REAL
is actually a pointer to double
, but ll
is a long long
pointer to the same address as where jcol
is.
So then ll[i] == ll[i-1]
treats the elements of jcol
(which are actually real) as 8-byte integers & compares those 64 bits.
Where as (long long)pz[i].r
(etc) is actually copying the double
at pz[i].r
and casting it to long long
, leading to truncation, so the operations aren't equivalent
Is that the gist of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes: exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a native 16-byte type to pun to? I'm not seeing one. I guess we could just use pure equality-on-doubles & leave it at that as it's low enough priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way, since all the cases have the same code other than the comparison, is to take away all the cases and just do one memcmp
for all types (just pass the size to memcmp
). Would need benchmarking on large data to see if memcmp
was any slower (overhead might come into it because memcmp
would be called a lot). I didn't look closely to see if there are any other difference between all those cases in this switch in this file, to know whether that's possible. It may be faster to pass a fixed constant to memcmp
in each of the cases (rather than passing a variable size to it) because then gcc optimizer would convert the memcmp to the best method in assembly for that size and avoid a function call.
src/uniqlist.c
Outdated
Rcomplex *pzjcol = COMPLEX(jcol); | ||
for (R_xlen_t i=1; i<nrow; i++) { | ||
bool same = (long long)pzjcol[i].r == (long long)pzjcol[i-1].r && | ||
(long long)pzjcol[i].i == (long long)pzjcol[i-1].i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here (pun needed not cast)
…alves the working ram needed as it reuses the same tmp for the rerun
Closes #1703
and other parts of #3690.
I took a look into adding support "properly" (directly on the complex columns themselves) at the C level but the effort level there seems prohibitive -- would have to modularize the code a bit to allow
CPLXSXP
to delegate toREALSXP
engine on.r
and.i
components. #1703 was filed 3 years ago without much follow-up, so I assume the simple approach here will do just fine. A quick benchmark suggested the current implementation beatsbase::order
and an approach that explicitly delegates likeorder(Re(z), Im(z))
in terms of speed & memory efficiency.Sorting itself is done I think.
Now working on the stuff downstream of
forderv
that this should unlock --key
s, joins, etc.TO DO:
frank
rowid
dcast
by
rleid
setorder
/setkey
unique