Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors writing file with missing in categorical #117

Closed
jtrakk opened this issue Jan 29, 2021 · 0 comments · Fixed by #119
Closed

Errors writing file with missing in categorical #117

jtrakk opened this issue Jan 29, 2021 · 0 comments · Fixed by #119

Comments

@jtrakk
Copy link

jtrakk commented Jan 29, 2021

julia> using Base.Filesystem, CategoricalArrays, DataFrames
       open(Filesystem.tempname(), "w") do f
           Arrow.write(f, DataFrame(x=CategoricalArray(["a", "bb", missing])))
       end
ERROR: InexactError: trunc(UInt32, -1)

Sometimes I get a different error on another dataframe but I don't have a reproducible example for this one:

julia> open(path, "w") do f
           Arrow.write(f, df)
       end
ERROR: BoundsError: attempt to access 52-element CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String,UInt32}},CategoricalPool{String,UInt32,CategoricalValue{String,UInt32}}} with indices 0:51 at index [52]
Stacktrace:
 [1] throw_boundserror(::CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String,UInt32}},CategoricalPool{String,UInt32,CategoricalValue{String,UInt32}}}, ::Tuple{Int64}) at ./abstractarray.jl:541
 [2] checkbounds at ./abstractarray.jl:506 [inlined]
 [3] getindex(::CategoricalArrays.CategoricalRefPool{Union{Missing, CategoricalValue{String,UInt32}},CategoricalPool{String,UInt32,CategoricalValue{String,UInt32}}}, ::Int64)```

[69666777] Arrow v1.2.1
Julia 1.5.3

quinnj added a commit that referenced this issue Jan 30, 2021
…ng issues

Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.
quinnj added a commit that referenced this issue Jan 31, 2021
#119)

* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues

Fixes #117, #116, and #113. For #116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For #113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For #117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.

* PooledArrays 1.0 compat

* Update src/arraytypes/dictencoding.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

* Check refpool

* Fix test

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this issue Apr 7, 2021
apache#119)

* Rework dict encoding of PooledArray/CategoricalArray to fix outstanding issues

Fixes apache#117, apache#116, and apache#113. For apache#116, we just need to special case if user happens to pass in a DictEncoded themselves. We need to pass it through to the `toarrowvector` method that no-ops. For apache#113, we require the new functionality in PooledArrays that allows passing the `signed` and `compress` keyword arguments to ensure we get signed refs for our dict encoding. For apache#117, we add CategoricalArrays as a test dependency and ensure that if it contains any `missing` value, we *don't* recode the indices values down by 1, since the `missing` ref is 0, so other refs can already be considered "offsets". If there are no `missing`, then we still need to recode down since refs should always start from 0 in arrow format.

* PooledArrays 1.0 compat

* Update src/arraytypes/dictencoding.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

* Check refpool

* Fix test

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant