Arrow.write in v1.4.2 can create an invalid arrow file #126

dmbates · 2021-02-07T15:58:41Z

Running https://github.com/crsl4/julia-workshop/blob/main/notebooks/consistency.jmd followed by https://github.com/crsl4/julia-workshop/blob/main/notebooks/Arrow.jmd produces a file 02.arrow that throws an error when trying to read it. In a Jupyter notebook an attempt to read the file with Arrow.Table produces

TaskFailedException

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] fetch
       @ ./task.jl:332 [inlined]
     [3] macro expansion
       @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:198 [inlined]
     [4] (::Arrow.var"#82#88"{Channel{Task}, Arrow.Table})()
       @ Arrow ./threadingconstructs.jl:169
    
        nested task error: KeyError: key 1 not found
        Stacktrace:
         [1] getindex
           @ ./dict.jl:482 [inlined]
         [2] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:355
         [3] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:336
         [4] iterate
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:333 [inlined]
         [5] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
           @ Base ./abstractarray.jl:837
         [6] _collect
           @ ./array.jl:608 [inlined]
         [7] collect
           @ ./array.jl:602 [inlined]
         [8] macro expansion
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:254 [inlined]
         [9] (::Arrow.var"#86#92"{Bool, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch})()
           @ Arrow ./threadingconstructs.jl:169

Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] Arrow.Table(bytes::Vector{UInt8}, off::Int64, tlen::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:261
 [3] Arrow.Table(str::String, pos::Int64, len::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:179
 [4] Table
   @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:179 [inlined]
 [5] top-level scope
   @ In[32]:1
 [6] eval
   @ ./boot.jl:360 [inlined]
 [7] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base ./loading.jl:1090

pyarrow.feather.read_table also produces an error saying "key 1 not found"

The text was updated successfully, but these errors were encountered:

dmbates · 2021-02-09T17:04:57Z

I missed the cause on this problem. I thought it was because I had changed some of the columns from Int to Int8, etc. It turns out it was because I wrote the Arrow file from a DataFrame created from an Arrow.Table. In other words, it appears to be a round-tripping issue. I will work on a MWE

dmbates · 2021-02-09T17:39:13Z

So I have been able to reduce it to

julia> Arrow.write("02_validated_output.arrow", CSV.File("/home/bates/git/mb1-analysis-public/processed_data/02_validated_output.csv", missingstrings=["NA"], truestrings=["TRUE"], falsestrings=["FALSE"]))

julia> Arrow.write("02.arrow", DataFrame(Arrow.Table("02_validated_output.arrow")))

julia> tbl = Arrow.Table("02.arrow")
ERROR: TaskFailedException
Stacktrace:
 [1] wait
   @ ./task.jl:317 [inlined]
 [2] Arrow.Table(bytes::Vector{UInt8}, off::Int64, tlen::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:261
 [3] Arrow.Table(str::String, pos::Int64, len::Nothing; convert::Bool)
   @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:179
 [4] Table
   @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:179 [inlined]
 [5] top-level scope
   @ REPL[63]:1

    nested task error: TaskFailedException
    Stacktrace:
     [1] wait
       @ ./task.jl:317 [inlined]
     [2] fetch
       @ ./task.jl:332 [inlined]
     [3] macro expansion
       @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:198 [inlined]
     [4] (::Arrow.var"#82#88"{Channel{Task}, Arrow.Table})()
       @ Arrow ./threadingconstructs.jl:169
    
        nested task error: KeyError: key 1 not found
        Stacktrace:
         [1] getindex
           @ ./dict.jl:482 [inlined]
         [2] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:355
         [3] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
           @ Arrow ~/.julia/packages/Arrow/Re9EM/src/table.jl:336
         [4] iterate
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:333 [inlined]
         [5] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
           @ Base ./abstractarray.jl:837
         [6] _collect
           @ ./array.jl:608 [inlined]
         [7] collect
           @ ./array.jl:602 [inlined]
         [8] macro expansion
           @ ~/.julia/packages/Arrow/Re9EM/src/table.jl:254 [inlined]
         [9] (::Arrow.var"#86#92"{Bool, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch})()
           @ Arrow ./threadingconstructs.jl:169

Is it worthwhile my working on exactly which columns are causing the problem?

dmbates · 2021-02-09T18:26:49Z

I thought that it might be because there are two columns in the data frame, bilingual_parenta_nae and bilingual_parentb_nae, that do not have any non-missing data. But that does not seem to be the problem.

dmbates · 2021-02-09T18:35:46Z

It seems to be related to the size of the file and perhaps the number of threads. Both

julia> Arrow.Table(Arrow.write("02.arrow", select(df, 1:40)))
julia> Arrow.Table(Arrow.write("02.arrow", select(df, 41:80)))

succeed but

julia> Arrow.Table(Arrow.write("02.arrow", select(df, 1:80)))

fails on my system

julia> versioninfo(verbose=true)
Julia Version 1.6.0-rc1
Commit a58bdd9010 (2021-02-06 15:49 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 20.10
  uname: Linux 5.8.0-43-generic #49-Ubuntu SMP Fri Feb 5 03:01:28 UTC 2021 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz: 
              speed         user         nice          sys         idle          irq
       #1  3788 MHz       7211 s        115 s       1158 s     113858 s          0 s
       #2  3767 MHz       5846 s         30 s       1179 s     115156 s          0 s
       #3  3713 MHz       5572 s         11 s       1224 s     115320 s          0 s
       #4  3684 MHz       5141 s       1163 s       1164 s     114840 s          0 s
       
  Memory: 15.516902923583984 GB (5832.85546875 MB free)
  Uptime: 12279.0 sec
  Load Avg:  0.24  0.26  0.28
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  HOME = /home/bates
  TERM = screen
  PATH = /home/bates/bin:/home/bates/.cargo/bin:/home/bates/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

dmbates · 2021-02-09T20:22:52Z

Actually writing select(df, 1:80) produces a file that I can read but, as shown above, Arrow.write("02.arrow", df) doesn't.

quinnj · 2021-03-11T01:11:43Z

Sorry for the delay here; I think at first I saw you were continuing to investigate and then it kind of got buried in my emails. I can reproduce and I'll dig in.

quinnj · 2021-03-11T01:22:06Z

I think the problem is the dictionary message is being written after the data record batch message. Let me dig in a bit more to confirm and figure out why the messages are being written out of order. @nilshg, I think this is the same issue you were running into (as discussed today on slack)

nilshg · 2021-03-11T07:03:16Z

Yes, looks like the same issue. I should say that my file is also quite large (100m rows, 9GB in arrow format) and I'm writing with 8 threads. While I can't share the data happy to try writing a smaller part of that file and/or writing on one thread if that helps?

Fixes #126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.

quinnj · 2021-03-12T06:04:52Z

Ok, fix is up: #149. I put all the gory details in the PR message, but the short of it was when Arrow.write needed to serialize an Arrow.DictEncoded, it wasn't queuing up an appropriate dictionary message to be written before the record batch.

Fixes #126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.

nilshg · 2021-03-13T08:04:11Z

Don't think I'd want to reopen this, just to say that using main I can successfully write out my large file and read it back in, however on writing I now get a bunch of these messages:

DEBUG: C:\Users\ngudat\.julia\packages\Arrow\8X1NB\src\write.jl:237 writing message: metalen = 2512, bodylen = 0, isrecordbatch = false, headerType = Arrow.Flatbuf.Schema
DEBUG: C:\Users\ngudat\.julia\packages\Arrow\8X1NB\src\write.jl:237 writing message: metalen = 176, bodylen = 40, isrecordbatch = false, headerType = Arrow.Flatbuf.DictionaryBatch

These didn't appear when writing the same file on 1.2.4, not sure whether I should be worried about them? The table has the correct dimensions upon reading it, but I haven't checked for cell-by-cell equality (it's 28 million rows, 34 columns so a bit arudous to do...)

quinnj · 2021-03-13T16:49:23Z

Whoops! Those are just some debug statements I left turned on. There are probably a few @debug -1 ... expressions that need to be @debug 1. Feel free to PR those changes, or I can get to them later today.

…he#149) Fixes apache#126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.

* Test/suppress error log issue #126 test * Test misc warnings

quinnj mentioned this issue Mar 12, 2021

Ensure serializing Arrow.DictEncoded writes dictionary messages #149

Merged

quinnj closed this as completed in #149 Mar 12, 2021

omus added a commit that referenced this issue May 27, 2021

Test/suppress error log issue #126 test

55ce07b

quinnj pushed a commit that referenced this issue Jun 7, 2021

Test expected log records (#204)

f801672

* Test/suppress error log issue #126 test * Test misc warnings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow.write in v1.4.2 can create an invalid arrow file #126

Arrow.write in v1.4.2 can create an invalid arrow file #126

dmbates commented Feb 7, 2021 •

edited

Loading

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

quinnj commented Mar 11, 2021

quinnj commented Mar 11, 2021

nilshg commented Mar 11, 2021

quinnj commented Mar 12, 2021

nilshg commented Mar 13, 2021

quinnj commented Mar 13, 2021

Arrow.write in v1.4.2 can create an invalid arrow file #126

Arrow.write in v1.4.2 can create an invalid arrow file #126

Comments

dmbates commented Feb 7, 2021 • edited Loading

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

dmbates commented Feb 9, 2021

quinnj commented Mar 11, 2021

quinnj commented Mar 11, 2021

nilshg commented Mar 11, 2021

quinnj commented Mar 12, 2021

nilshg commented Mar 13, 2021

quinnj commented Mar 13, 2021

dmbates commented Feb 7, 2021 •

edited

Loading