Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

up DataAPI.jl to 1.7 and CategoricalArrays.jl to 0.10.0 #2807

Merged
merged 2 commits into from
Jun 28, 2021
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 26, 2021

Fixes #2788

@bkamins bkamins added CI ecosystem Issues in DataFrames.jl ecosystem labels Jun 26, 2021
@bkamins bkamins added this to the patch milestone Jun 26, 2021
@@ -355,8 +355,8 @@ end
@test levels(leftjoin(A, B, on=:b).b) == ["d", "c", "b", "a"]
@test levels(leftjoin(B, A, on=:b).b) == ["a", "b", "c"]
@test levels(rightjoin(A, B, on=:b).b) == ["a", "b", "c"]
@test levels(rightjoin(B, A, on=:b).b) == ["d", "c", "b", "a"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - these changes follow CategoricalArrays.jl 0.10.0 release, so a check would be welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Why aren't levels the same for inner and outer joins? ["d", "c", "b", "a"] would sound like a good choice since levels(A.b) is a superset of levels(B.b).

That said, since the old behavior wasn't ideal either, the new one is OK -- both are kind of arbitrary.

@bkamins bkamins requested a review from nalimilan June 26, 2021 19:32
Project.toml Show resolved Hide resolved
test/join.jl Outdated Show resolved Hide resolved
@@ -355,8 +355,8 @@ end
@test levels(leftjoin(A, B, on=:b).b) == ["d", "c", "b", "a"]
@test levels(leftjoin(B, A, on=:b).b) == ["a", "b", "c"]
@test levels(rightjoin(A, B, on=:b).b) == ["a", "b", "c"]
@test levels(rightjoin(B, A, on=:b).b) == ["d", "c", "b", "a"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Why aren't levels the same for inner and outer joins? ["d", "c", "b", "a"] would sound like a good choice since levels(A.b) is a superset of levels(B.b).

That said, since the old behavior wasn't ideal either, the new one is OK -- both are kind of arbitrary.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jun 27, 2021

Why aren't levels the same for inner and outer joins?

The reason is (under CategoricalArrays.jl 0.10):

julia> x = categorical(["a", "b", "c", "d"])
4-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

julia> levels!(x, ["d", "c", "b", "a"])
4-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "d"

julia> y = categorical(["a", "b", "c"])
3-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"

julia> levels(x)
4-element Vector{String}:
 "d"
 "c"
 "b"
 "a"

julia> levels(y)
3-element Vector{String}:
 "a"
 "b"
 "c"

julia> levels([x; y])
4-element Vector{String}:
 "d"
 "c"
 "b"
 "a"

julia> levels([y; x])
4-element Vector{String}:
 "d"
 "a"
 "b"
 "c"

so as you can see the order of vectors matters (in this example in vcat).

And in the innerjoin the order is different because:

julia> innerjoin(B, A, on=:b)
3×3 DataFrame
 Row │ b     c      a
     │ Cat…  Int64  Int64
─────┼────────────────────
   1 │ a         5      1
   2 │ b         6      2
   3 │ c         7      3

and as you can see the "d" level does not get added (in innerjoin we use the left data frame as a source for columns we join-on, in outerjoin we need to use both).

@nalimilan
Copy link
Member

Ah right that's because the orders are not compatible, so we have to make a choice.

@bkamins bkamins merged commit d701c09 into main Jun 28, 2021
@bkamins bkamins deleted the bk/up_dataapi branch June 28, 2021 09:55
@bkamins
Copy link
Member Author

bkamins commented Jun 28, 2021

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync with DataAPI.jl 1.7 release
2 participants