-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow.write in v1.4.2 can create an invalid arrow file #126
Comments
I missed the cause on this problem. I thought it was because I had changed some of the columns from |
So I have been able to reduce it to
Is it worthwhile my working on exactly which columns are causing the problem? |
I thought that it might be because there are two columns in the data frame, |
It seems to be related to the size of the file and perhaps the number of threads. Both
succeed but
fails on my system
|
Actually writing |
Sorry for the delay here; I think at first I saw you were continuing to investigate and then it kind of got buried in my emails. I can reproduce and I'll dig in. |
I think the problem is the dictionary message is being written after the data record batch message. Let me dig in a bit more to confirm and figure out why the messages are being written out of order. @nilshg, I think this is the same issue you were running into (as discussed today on slack) |
Yes, looks like the same issue. I should say that my file is also quite large (100m rows, 9GB in arrow format) and I'm writing with 8 threads. While I can't share the data happy to try writing a smaller part of that file and/or writing on one thread if that helps? |
Fixes #126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.
Ok, fix is up: #149. I put all the gory details in the PR message, but the short of it was when |
Fixes #126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.
Don't think I'd want to reopen this, just to say that using
These didn't appear when writing the same file on |
Whoops! Those are just some debug statements I left turned on. There are probably a few |
…he#149) Fixes apache#126. The issue here was when `Arrow.write` was faced with the task of serializing an `Arrow.DictEncoded`. For most arrow array types, if the input array is already an arrow array type, it's a no-op (e.g. if you're writing out an `Arrow.Table`). The problem comes from `Arrow.DictEncoded`, where there is still no conversion required, but we do need to make a note of the dict encoded column to ensure a dictionary message is written before the record batch. In addition, we also add some code for handling delta dictionary messages if required from multiple record batches that contain `Arrow.DictEncoded`s, which is a valid use-case where you may have multiple arrow files, with the same schema, that you wish to serialize as a single arrow file w/ each file as a separate record batch. Slightly unrelated, but there's also a fix here in our use of Lockable. We actually had a race condition I ran into once where the locking was on the Lockable object, but inside the locked region, we replaced the entire Lockable instead of the _contents_ of the Lockable. This meant anyone who started waiting on the Lockable's lock didn't see updates when unlocked because the entire Lockable had been updated.
* Test/suppress error log issue #126 test * Test misc warnings
Running https://github.com/crsl4/julia-workshop/blob/main/notebooks/consistency.jmd followed by https://github.com/crsl4/julia-workshop/blob/main/notebooks/Arrow.jmd produces a file
02.arrow
that throws an error when trying to read it. In a Jupyter notebook an attempt to read the file withArrow.Table
producespyarrow.feather.read_table
also produces an error saying "key 1 not found"The text was updated successfully, but these errors were encountered: