Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pooled vcat #18

Merged
merged 6 commits into from
Sep 26, 2016
Merged

pooled vcat #18

merged 6 commits into from
Sep 26, 2016

Conversation

gustafsson
Copy link
Contributor

@gustafsson gustafsson commented Sep 17, 2016

This PR introduces three features:

  1. vcat compares the unique elements in the pools only once instead of comparing each new inserted element to the accumulated pool. This is more efficient for large arrays with small pools. The translation between the individual pools and the total pool is then an O(1) operation.
  2. vcat on compact arrays may overflow. Here I'm always expanding the reftype to DefaultRefType. I figure you'd prefer to keep compact arrays if possible but I don't know how to do that without introducing type instability.
  3. For ordered levels vcat preserves the order based on the first argument. If the other arguments are ordered and doesn't comply, or if they have levels not included in the order defined by the first argument; then the returned array won't have ordered levels and a warning is shown that the ordering is dropped.

Related to JuliaStats/DataArrays.jl#213

@codecov-io
Copy link

codecov-io commented Sep 17, 2016

Current coverage is 84.63% (diff: 100%)

Merging #18 into master will increase coverage by 0.76%

@@             master        #18   diff @@
==========================================
  Files             7          7          
  Lines           341        358    +17   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits            286        303    +17   
  Misses           55         55          
  Partials          0          0          

Powered by Codecov. Last update b176bb4...79e4ce1

@@ -212,6 +212,31 @@ end

convert($A{T, N, R}, A)
end

function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import the function at the top of the file and remove Base..


ordered = A1.pool.ordered
if ordered && levels != A1.pool.levels
warn("Failed to preserve order of levels. Define all levels in the first argument.")
Copy link
Member

@nalimilan nalimilan Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea the check this, but we generally avoid in Julia printing warnings: either it should be an error, or it should work. Warnings that are printed when the code works are annoying for users.

I think it's fine to silently make the result unordered if the order cannot be respected. We'll have to document this at some point of course (for now the package has no docstrings).

We should also be able to make this more flexible: it's fine if levels are different as long as their orders are consistent. For example, ["a", "b", "c"] and ["a", "b"]/["a", "b", "c", "d"]/["b", "c", "d"] are fine. So you just need to check that levels for each input vectors are included (in the same order) in the output levels.

That requires a clever algorithm to compute the levels in the first place: from the levels computed above, which are sorted according to the rank of the input, I guess you could repeatedly sort it so that elements that are common with the levels of a given input are in the same order. If you do that for each input, you should get the correct order when possible, and a reasonable order else (in which case the check I described in the previous paragraph will fail, and an unordered result will be returned). This strategy could be applied to unordered inputs too.

levels = unique(T[[a.pool.levels for a in As]...;])
idx = [indexin(a.pool.index, levels) for a in As]

ordered = A1.pool.ordered
Copy link
Member

@nalimilan nalimilan Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result should be ordered only if all inputs are. Also, use high-level methods where possible: ordered(A1).


function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...)
As = (A1, An...)
levels = unique(T[[a.pool.levels for a in As]...;])
Copy link
Member

@nalimilan nalimilan Sep 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning, the refs field codes refer to the index, not to the levels (which can be in a different order). Also, use high-level methods: index(a.pool) instead of a.pool.index (here and below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intent was for idx (or indexmap as I should probably have called it) to take care of this and I thought this was covered by one of the test blocks. Could you show me an example that breaks this when index and levels are in different orders?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested, but in general better work with index when dealing with refs, as they refer to it, not to levels. Worth having a test after reordering levels anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I understand that we need to preserve the ordering of levels (not of index). So do anything that works, as long as you use the accessors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, should be good. Actually, the codes in index depend on the order of appearance in the input array, so even without calling levels! that test should already be OK (since values are not sorted in the original array for a2).

function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...)
As = (A1, An...)
levels = unique(T[[a.pool.levels for a in As]...;])
idx = [indexin(a.pool.index, levels) for a in As]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like indexmap would be a more explicit name than idx.


function Base.vcat{T,N,R}(A1::$A{T, N, R}, An::$A...)
As = (A1, An...)
levels = unique(T[[a.pool.levels for a in As]...;])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T for the first input shouldn't win over the other inputs. Use Base.promote_eltype to compute the best element type.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@gustafsson gustafsson force-pushed the pooledvcat branch 2 times, most recently from 59a2dd7 to c73959a Compare September 26, 2016 10:51
@coveralls
Copy link

coveralls commented Sep 26, 2016

Coverage Status

Coverage increased (+0.8%) to 84.68% when pulling 610b701 on gustafsson:pooledvcat into b176bb4 on nalimilan:master.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks pretty good! One last round and it should be ready.

ordered = true

for a in A
i = indexin(a, m)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is usually used for scalars. levelsmap would be more explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that actually made the code more clear

@@ -325,3 +332,31 @@ function getindex(A::CategoricalArray, i::Int)
end

levels!(A::CategoricalArray, newlevels::Vector) = _levels!(A, newlevels)


function sortedmerge(A...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like mergelevels(levels...) be more appropriate?

m = Array{T}(0)
ordered = true

for a in A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here a could be changed to l. Else I keep thinking that's the actual contents of the array.

ordered &= issorted(i[i.!=0])
if !ordered
append!(m, a[i.==0])
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it more logical to have a else black instead of a continue.


function sortedmerge(A...)
T = Base.promote_eltype(A...)
m = Array{T}(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m is usually used for integers, maybe better use e.g. res?

continue
end

x = length(m)+1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I would use i or j instead of x.

@@ -522,6 +522,56 @@ for isordered in (false, true)
@test x[1] === x.pool.valindex[3]
@test x[2] === x.pool.valindex[1]
@test levels(x) == ["c", "a", "b"]

a1 = 3:200
Copy link
Member

@nalimilan nalimilan Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment saying what you're testing here? It would be nice to have one general title about testing vcat (with two line breaks before), and comments for each test explaining what's the difference with the other ones you add.

@@ -212,6 +212,13 @@ end

convert($A{T, N, R}, A)
end

function vcat(A::$A...)
L, O = sortedmerge(map(levels, A)...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper case letters generally denote types. Better use newlevels and isordered here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper case letters as argument names seems to be a convention here though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right, I think I inherited this from DataArrays, NullableArrays or Base. But I think I would prefer if x was used everywhere. :-)

@test levels(r) == ["Young", "Middle", "Old"]
@test ordered(r) == true

# Test concatenation of ambiguous ordering. This drops the ordering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a case with conflicting orderings? Or is that already covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified a test comment that it was already covered

cca1 = compact(ca1)
cca2 = compact(ca2)
r = vcat(cca1, cca2)
@test r == vcat(a1, a2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and you need to test levels and ordered after each vcat here and below.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 84.637% when pulling 42ff3bc on gustafsson:pooledvcat into b176bb4 on nalimilan:master.

@test isa(r, CategoricalArray{Int,1,CategoricalArrays.DefaultRefType})
@test isa(vcat(cca1, ca2), CategoricalArray{Int,1,CategoricalArrays.DefaultRefType})
@test ordered(r) == false
@test sort(levels(r)) == collect(3:300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling sort kind of defeats the purpose of testing the order of levels. Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I'm not testing the order of levels. The order of levels doesn't matter if ordered = false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's less of an issue if the order of levels changes, but we should still return them in a meaningful and stable order. So better test that. In general it's a good idea to test everything, and possibly adapt the tests if we decide to change behaviour at some point.

@coveralls
Copy link

coveralls commented Sep 26, 2016

Coverage Status

Coverage increased (+0.8%) to 84.637% when pulling 79e4ce1 on gustafsson:pooledvcat into b176bb4 on nalimilan:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.8%) to 84.637% when pulling 79e4ce1 on gustafsson:pooledvcat into b176bb4 on nalimilan:master.

@nalimilan nalimilan merged commit 3b687af into JuliaData:master Sep 26, 2016
@nalimilan
Copy link
Member

Thanks! Would you feel like extending this to NullableCategoricalArray in another PR?

@gustafsson gustafsson deleted the pooledvcat branch September 26, 2016 16:02
@gustafsson
Copy link
Contributor Author

You're welcome.

This is already extended for NullableCategoricalArray
https://github.com/nalimilan/CategoricalArrays.jl/blob/master/src/array.jl#L23

What's missing is the corresponding tests. So how do I do the equivalent of r == vcat(a1, a2) ? On a regular Array{Nullable} I can call get to get the value. But how do I get the value out of a NullableCategoricalArray?

@nalimilan
Copy link
Member

Cool. == is aliased to isequal in the nullable array tests, so using it should just work as long as the RHS is wrapped in a NullableArray (see existing examples). Other than that, copying the tests should be OK.

@gustafsson
Copy link
Contributor Author

Ok then

Still though, how to I get the value out of a CategoricalNullableArray? :)

This feels a bit cumbersome, no?

> ca = NullableCategoricalArray(["a","b"])
> value(x::CategoricalArrays.CategoricalValue) = x.pool.levels[x.level]
> value(get(ca[1]))
"a"

@nalimilan
Copy link
Member

You're not supposed to extract the wrapped value ("a" in your example): the CategoricalValue wrapping it should behave like it for most purposes (notably, == and convert). Of course there's also the Nullable wrapper, but this should be made mostly invisible via Query.jl/StructuredQueries.jl (I hope).

@gustafsson
Copy link
Contributor Author

But then I can't use CategoricalArray if the value of a category means something, such as a string label.

@nalimilan
Copy link
Member

That's right. We should probably add an accessor function like value you showed above (it would be identical to convert(T, v::CategoricalValue{T}). Not sure what's the best name: maybe get just like for Nullable? Feel free to make a PR.

@gustafsson
Copy link
Contributor Author

Ok! #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants