Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow.write in v1.4.2 can create an invalid arrow file #126

Closed
dmbates opened this issue Feb 7, 2021 · 11 comments · Fixed by #149
Closed

Arrow.write in v1.4.2 can create an invalid arrow file #126

dmbates opened this issue Feb 7, 2021 · 11 comments · Fixed by #149

Comments

@dmbates
Copy link
Contributor

dmbates commented Feb 7, 2021

Running https://github.com/crsl4/julia-workshop/blob/main/notebooks/consistency.jmd followed by https://github.com/crsl4/julia-workshop/blob/main/notebooks/Arrow.jmd produces a file 02.arrow that throws an error when trying to read it. In a Jupyter notebook an attempt to read the file with Arrow.Table produces

TaskFailedException

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] fetch
       @ ./task.jl:332 [inlined]
     [3] macro expansion
       @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:198 [inlined]
     [4] (::Arrow.var"#82#88"{Channel{Task}, Arrow.Table})()
       @ Arrow ./threadingconstructs.jl:169
    
        nested task error: KeyError: key 1 not found
        Stacktrace:
         [1] getindex
           @ ./dict.jl:482 [inlined]
         [2] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:355
         [3] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:336
         [4] iterate
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:333 [inlined]
         [5] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
           @ Base ./abstractarray.jl:837
         [6] _collect
           @ ./array.jl:608 [inlined]
         [7] collect
           @ ./array.jl:602 [inlined]
         [8] macro expansion
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:254 [inlined]
         [9] (::Arrow.var"#86#92"{Bool, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch})()
           @ Arrow ./threadingconstructs.jl:169

Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] Arrow.Table(bytes::Vector{UInt8}, off::Int64, tlen::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:261
 [3] Arrow.Table(str::String, pos::Int64, len::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:179
 [4] Table
   @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:179 [inlined]
 [5] top-level scope
   @ In[32]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1090

pyarrow.feather.read_table also produces an error saying "key 1 not found"

@dmbates
Copy link
Contributor Author

dmbates commented Feb 9, 2021

I missed the cause on this problem. I thought it was because I had changed some of the columns from Int to Int8, etc. It turns out it was because I wrote the Arrow file from a DataFrame created from an Arrow.Table. In other words, it appears to be a round-tripping issue. I will work on a MWE

@dmbates
Copy link
Contributor Author

dmbates commented Feb 9, 2021

So I have been able to reduce it to

julia> Arrow.write("02_validated_output.arrow", CSV.File("/home/bates/git/mb1-analysis-public/processed_data/02_validated_output.csv", missingstrings=["NA"], truestrings=["TRUE"], falsestrings=["FALSE"]))

julia> Arrow.write("02.arrow", DataFrame(Arrow.Table("02_validated_output.arrow")))

julia> tbl = Arrow.Table("02.arrow")
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] Arrow.Table(bytes::Vector{UInt8}, off::Int64, tlen::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:261
 [3] Arrow.Table(str::String, pos::Int64, len::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:179
 [4] Table
   @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:179 [inlined]
 [5] top-level scope
   @ REPL[63]:1

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] fetch
       @ ./task.jl:332 [inlined]
     [3] macro expansion
       @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:198 [inlined]
     [4] (::Arrow.var"#82#88"{Channel{Task}, Arrow.Table})()
       @ Arrow ./threadingconstructs.jl:169
    
        nested task error: KeyError: key 1 not found
        Stacktrace:
         [1] getindex
           @ ./dict.jl:482 [inlined]
         [2] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:355
         [3] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:336
         [4] iterate
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:333 [inlined]
         [5] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
           @ Base ./abstractarray.jl:837
         [6] _collect
           @ ./array.jl:608 [inlined]
         [7] collect
           @ ./array.jl:602 [inlined]
         [8] macro expansion
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:254 [inlined]
         [9] (::Arrow.var"#86#92"{Bool, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch})()
           @ Arrow ./threadingconstructs.jl:169

Is it worthwhile my working on exactly which columns are causing the problem?

@dmbates
Copy link
Contributor Author

dmbates commented Feb 9, 2021

I thought that it might be because there are two columns in the data frame, bilingual_parenta_nae and bilingual_parentb_nae, that do not have any non-missing data. But that does not seem to be the problem.

@dmbates
Copy link
Contributor Author

dmbates commented Feb 9, 2021

It seems to be related to the size of the file and perhaps the number of threads. Both

julia> Arrow.Table(Arrow.write("02.arrow", select(df, 1:40)))
julia> Arrow.Table(Arrow.write("02.arrow", select(df, 41:80)))

succeed but

julia> Arrow.Table(Arrow.write("02.arrow", select(df, 1:80)))

fails on my system

julia> versioninfo(verbose=true)
Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 20.10
  uname: Linux 5.8.0-43-generic #49-Ubuntu SMP Fri Feb 5 03:01:28 UTC 2021 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz: 
              speed         user         nice          sys         idle          irq
       #1  3788 MHz       7211 s        115 s       1158 s     113858 s          0 s
       #2  3767 MHz       5846 s         30 s       1179 s     115156 s          0 s
       #3  3713 MHz       5572 s         11 s       1224 s     115320 s          0 s
       #4  3684 MHz       5141 s       1163 s       1164 s     114840 s          0 s
       
  Memory: 15.516902923583984 GB (5832.85546875 MB free)
  Uptime: 12279.0 sec
  Load Avg:  0.24  0.26  0.28
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  HOME = /home/bates
  TERM = screen
  PATH = /home/bates/bin:/home/bates/.cargo/bin:/home/bates/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

@dmbates
Copy link
Contributor Author

dmbates commented Feb 9, 2021

Actually writing select(df, 1:80) produces a file that I can read but, as shown above, Arrow.write("02.arrow", df) doesn't.

@quinnj
Copy link
Member

quinnj commented Mar 11, 2021

Sorry for the delay here; I think at first I saw you were continuing to investigate and then it kind of got buried in my emails. I can reproduce and I'll dig in.

@quinnj
Copy link
Member

quinnj commented Mar 11, 2021

I think the problem is the dictionary message is being written after the data record batch message. Let me dig in a bit more to confirm and figure out why the messages are being written out of order. @nilshg, I think this is the same issue you were running into (as discussed today on slack)

@nilshg
Copy link

nilshg commented Mar 11, 2021

Yes, looks like the same issue. I should say that my file is also quite large (100m rows, 9GB in arrow format) and I'm writing with 8 threads. While I can't share the data happy to try writing a smaller part of that file and/or writing on one thread if that helps?

quinnj added a commit that referenced this issue Mar 12, 2021
Fixes #126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.

Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.
@quinnj
Copy link
Member

quinnj commented Mar 12, 2021

Ok, fix is up: #149. I put all the gory details in the PR message, but the short of it was when Arrow.write needed to serialize an Arrow.DictEncoded, it wasn't queuing up an appropriate dictionary message to be written before the record batch.

quinnj added a commit that referenced this issue Mar 12, 2021
Fixes #126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.

Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.
@nilshg
Copy link

nilshg commented Mar 13, 2021

Don't think I'd want to reopen this, just to say that using main I can successfully write out my large file and read it back in, however on writing I now get a bunch of these messages:

DEBUG: C:\Users\ngudat\.julia\packages\Arrow\8X1NB\src\write.jl:237 writing message: metalen = 2512, bodylen = 0, isrecordbatch = false, headerType = Arrow.Flatbuf.Schema
DEBUG: C:\Users\ngudat\.julia\packages\Arrow\8X1NB\src\write.jl:237 writing message: metalen = 176, bodylen = 40, isrecordbatch = false, headerType = Arrow.Flatbuf.DictionaryBatch

These didn't appear when writing the same file on 1.2.4, not sure whether I should be worried about them? The table has the correct dimensions upon reading it, but I haven't checked for cell-by-cell equality (it's 28 million rows, 34 columns so a bit arudous to do...)

@quinnj
Copy link
Member

quinnj commented Mar 13, 2021

Whoops! Those are just some debug statements I left turned on. There are probably a few @debug -1 ... expressions that need to be @debug 1. Feel free to PR those changes, or I can get to them later today.

tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this issue Apr 7, 2021
…he#149)

Fixes apache#126. The issue here was when `Arrow.write` was faced with the
task of serializing an `Arrow.DictEncoded`. For most arrow array types,
if the input array is already an arrow array type, it's a no-op (e.g. if
you're writing out an `Arrow.Table`). The problem comes from
`Arrow.DictEncoded`, where there is still no conversion required, but we
do need to make a note of the dict encoded column to ensure a dictionary
message is written before the record batch. In addition, we also add
some code for handling delta dictionary messages if required from
multiple record batches that contain `Arrow.DictEncoded`s, which is a
valid use-case where you may have multiple arrow files, with the same
schema, that you wish to serialize as a single arrow file w/ each file
as a separate record batch.

Slightly unrelated, but there's also a fix here in our use of Lockable.
We actually had a race condition I ran into once where the locking was
on the Lockable object, but inside the locked region, we replaced the
entire Lockable instead of the _contents_ of the Lockable. This meant
anyone who started waiting on the Lockable's lock didn't see updates
when unlocked because the entire Lockable had been updated.
omus added a commit that referenced this issue May 27, 2021
quinnj pushed a commit that referenced this issue Jun 7, 2021
* Test/suppress error log issue #126 test

* Test misc warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants