Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle len of -1 in "compressed" buffers from other languages #442

Merged
merged 1 commit into from
May 23, 2023

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented May 22, 2023

It's unclear why other language implementations will have a compression set for arrow data, then indicate that the length is -1, as a sentinel value that the data is actually not compressed. But since they do, we can handle that case pretty easily. I'm basically just adding a test here from @DrChainsaw's original PR (#436 ).

It's unclear why other language implementations will have a compression set
for arrow data, then indicate that the length is -1, as a sentinel value
that the data is actually _not_ compressed. But since they do, we can handle
that case pretty easily. I'm basically just adding a test here from @DrChainsaw's
original PR.
@quinnj quinnj changed the title Handle len of -1 in "compresses" buffers from other languages Handle len of -1 in "compressed" buffers from other languages May 22, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #442 (b622bec) into main (94749c0) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #442      +/-   ##
==========================================
+ Coverage   87.06%   87.09%   +0.03%     
==========================================
  Files          26       26              
  Lines        3279     3279              
==========================================
+ Hits         2855     2856       +1     
+ Misses        424      423       -1     
Impacted Files Coverage Δ
src/table.jl 92.88% <100.00%> (+0.22%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@DrChainsaw
Copy link
Contributor

Thanks alot for this!

In case it is helpful, here is the java code which determines whether to set the -1 flag or not: https://github.com/apache/arrow/blob/fbe5f641d327ee81db00ce5f056940a69f4d8603/java/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L42-L53

The tl;dr is that they check whether the size after compression is larger than the uncompressed data. Since this can be different for different columns you can end up with a table with a mixture of compressed and non-compressed columns.

I suppose this is an optimization that the Julia writer could implement as well given that it seems like it is out there. I have no idea what the potential gains are though.

I have searched the "Specification and Protocols" section of the Arrow docs for rules on how to set the length when applying compression but I could not find anything. If you happen to know where it is specified I would be happy to take a look since it might help with the other issue I have encountered when reading files generated by the java implementation.

@quinnj
Copy link
Member Author

quinnj commented May 23, 2023

The tl;dr is that they check whether the size after compression is larger than the uncompressed data. Since this can be different for different columns you can end up with a table with a mixture of compressed and non-compressed columns.

Wow, what a terrible idea! If it ends up larger compressed, it's usually by a very small amount and you're usually dealing w/ small amount of data anyway, so the complication of mixing/matching compression w/ length sentinels just seems way overboard.

@quinnj quinnj merged commit 3008e7f into main May 23, 2023
@quinnj quinnj deleted the jq-neg-one-len branch May 23, 2023 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants