-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pooled vcat #18
pooled vcat #18
Conversation
416fc49
to
c3d4b4a
Compare
Current coverage is 84.63% (diff: 100%)@@ master #18 diff @@
==========================================
Files 7 7
Lines 341 358 +17
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 286 303 +17
Misses 55 55
Partials 0 0
|
@@ -212,6 +212,31 @@ end | |||
|
|||
convert($A{T, N, R}, A) | |||
end | |||
|
|||
function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import
the function at the top of the file and remove Base.
.
|
||
ordered = A1.pool.ordered | ||
if ordered && levels != A1.pool.levels | ||
warn("Failed to preserve order of levels. Define all levels in the first argument.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea the check this, but we generally avoid in Julia printing warnings: either it should be an error, or it should work. Warnings that are printed when the code works are annoying for users.
I think it's fine to silently make the result unordered if the order cannot be respected. We'll have to document this at some point of course (for now the package has no docstrings).
We should also be able to make this more flexible: it's fine if levels are different as long as their orders are consistent. For example, ["a", "b", "c"]
and ["a", "b"]
/["a", "b", "c", "d"]
/["b", "c", "d"]
are fine. So you just need to check that levels for each input vectors are included (in the same order) in the output levels.
That requires a clever algorithm to compute the levels in the first place: from the levels
computed above, which are sorted according to the rank of the input, I guess you could repeatedly sort it so that elements that are common with the levels of a given input are in the same order. If you do that for each input, you should get the correct order when possible, and a reasonable order else (in which case the check I described in the previous paragraph will fail, and an unordered result will be returned). This strategy could be applied to unordered inputs too.
levels = unique(T[[a.pool.levels for a in As]...;]) | ||
idx = [indexin(a.pool.index, levels) for a in As] | ||
|
||
ordered = A1.pool.ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result should be ordered only if all inputs are. Also, use high-level methods where possible: ordered(A1)
.
|
||
function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...) | ||
As = (A1, An...) | ||
levels = unique(T[[a.pool.levels for a in As]...;]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning, the refs
field codes refer to the index
, not to the levels
(which can be in a different order). Also, use high-level methods: index(a.pool)
instead of a.pool.index
(here and below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was for idx
(or indexmap
as I should probably have called it) to take care of this and I thought this was covered by one of the test blocks. Could you show me an example that breaks this when index
and levels
are in different orders?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested, but in general better work with index
when dealing with refs
, as they refer to it, not to levels
. Worth having a test after reordering levels anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I understand that we need to preserve the ordering of levels (not of index
). So do anything that works, as long as you use the accessors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, should be good. Actually, the codes in index
depend on the order of appearance in the input array, so even without calling levels!
that test should already be OK (since values are not sorted in the original array for a2
).
function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...) | ||
As = (A1, An...) | ||
levels = unique(T[[a.pool.levels for a in As]...;]) | ||
idx = [indexin(a.pool.index, levels) for a in As] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like indexmap
would be a more explicit name than idx
.
|
||
function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...) | ||
As = (A1, An...) | ||
levels = unique(T[[a.pool.levels for a in As]...;]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T
for the first input shouldn't win over the other inputs. Use Base.promote_eltype
to compute the best element type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
59a2dd7
to
c73959a
Compare
c73959a
to
610b701
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks pretty good! One last round and it should be ready.
ordered = true | ||
|
||
for a in A | ||
i = indexin(a, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i
is usually used for scalars. levelsmap
would be more explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that actually made the code more clear
@@ -325,3 +332,31 @@ function getindex(A::CategoricalArray, i::Int) | |||
end | |||
|
|||
levels!(A::CategoricalArray, newlevels::Vector) = _levels!(A, newlevels) | |||
|
|||
|
|||
function sortedmerge(A...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't something like mergelevels(levels...)
be more appropriate?
m = Array{T}(0) | ||
ordered = true | ||
|
||
for a in A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here a
could be changed to l
. Else I keep thinking that's the actual contents of the array.
ordered &= issorted(i[i.!=0]) | ||
if !ordered | ||
append!(m, a[i.==0]) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find it more logical to have a else
black instead of a continue
.
|
||
function sortedmerge(A...) | ||
T = Base.promote_eltype(A...) | ||
m = Array{T}(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m
is usually used for integers, maybe better use e.g. res
?
continue | ||
end | ||
|
||
x = length(m)+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I would use i
or j
instead of x
.
@@ -522,6 +522,56 @@ for isordered in (false, true) | |||
@test x[1] === x.pool.valindex[3] | |||
@test x[2] === x.pool.valindex[1] | |||
@test levels(x) == ["c", "a", "b"] | |||
|
|||
a1 = 3:200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment saying what you're testing here? It would be nice to have one general title about testing vcat
(with two line breaks before), and comments for each test explaining what's the difference with the other ones you add.
@@ -212,6 +212,13 @@ end | |||
|
|||
convert($A{T, N, R}, A) | |||
end | |||
|
|||
function vcat(A::$A...) | |||
L, O = sortedmerge(map(levels, A)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upper case letters generally denote types. Better use newlevels
and isordered
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upper case letters as argument names seems to be a convention here though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right, I think I inherited this from DataArrays, NullableArrays or Base. But I think I would prefer if x
was used everywhere. :-)
@test levels(r) == ["Young", "Middle", "Old"] | ||
@test ordered(r) == true | ||
|
||
# Test concatenation of ambiguous ordering. This drops the ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a case with conflicting orderings? Or is that already covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified a test comment that it was already covered
cca1 = compact(ca1) | ||
cca2 = compact(ca2) | ||
r = vcat(cca1, cca2) | ||
@test r == vcat(a1, a2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and you need to test levels
and ordered
after each vcat
here and below.
@test isa(r, CategoricalArray{Int,1,CategoricalArrays.DefaultRefType}) | ||
@test isa(vcat(cca1, ca2), CategoricalArray{Int,1,CategoricalArrays.DefaultRefType}) | ||
@test ordered(r) == false | ||
@test sort(levels(r)) == collect(3:300) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling sort
kind of defeats the purpose of testing the order of levels. Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I'm not testing the order of levels. The order of levels doesn't matter if ordered = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's less of an issue if the order of levels changes, but we should still return them in a meaningful and stable order. So better test that. In general it's a good idea to test everything, and possibly adapt the tests if we decide to change behaviour at some point.
1 similar comment
Thanks! Would you feel like extending this to |
You're welcome. This is already extended for NullableCategoricalArray What's missing is the corresponding tests. So how do I do the equivalent of |
Cool. |
Ok then Still though, how to I get the value out of a CategoricalNullableArray? :) This feels a bit cumbersome, no?
|
You're not supposed to extract the wrapped value ( |
But then I can't use CategoricalArray if the value of a category means something, such as a string label. |
That's right. We should probably add an accessor function like |
Ok! #21 |
This PR introduces three features:
DefaultRefType
. I figure you'd prefer to keep compact arrays if possible but I don't know how to do that without introducing type instability.Related to JuliaStats/DataArrays.jl#213