Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DictEncoded doesn't write as DictEncoded #116

Closed
dmbates opened this issue Jan 29, 2021 · 1 comment
Closed

DictEncoded doesn't write as DictEncoded #116

dmbates opened this issue Jan 29, 2021 · 1 comment

Comments

@dmbates
Copy link
Contributor

dmbates commented Jan 29, 2021

Arrow.toarrowvector of a Arrow.DictEncoded type returns an Arrow.List.

julia> atbl = Arrow.Table(fnm);

julia> typeof(atbl.a)
Arrow.DictEncoded{String, UInt8, Arrow.List{String, Int32, Vector{UInt8}}}

julia> Arrow.toarrowvector(atbl.a)
6-element Arrow.List{String, Int32, Arrow.ToList{UInt8, true, String, Int32}}:
 "a"
 "a"
 "a"
 "b"
 "b"
 "b"

Perhaps it doesn't detect that Arrow.DictEncoded is a factor-like (in the R sense) object.

I can look at this if you wish but I'm not entirely sure where

quinnj added a commit that referenced this issue Jan 30, 2021
…ng issues

Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.
quinnj added a commit that referenced this issue Jan 31, 2021
#119)

* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues

Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.

* PooledArrays 1.0 compat

* Update src/arraytypes/dictencoding.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

* Check refpool

* Fix test

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@quinnj
Copy link
Member

quinnj commented Jan 31, 2021

Fixed in #119

@quinnj quinnj closed this as completed Jan 31, 2021
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this issue Apr 7, 2021
apache#119)

* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues

Fixes apache#117, apache#116, and apache#113. For apache#116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For apache#113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For apache#117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.

* PooledArrays 1.0 compat

* Update src/arraytypes/dictencoding.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

* Check refpool

* Fix test

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants