Fix misaligned reference and logic error in crc32 #1906

saethlin · 2022-06-19T00:59:24Z

Which issue does this PR close?

Closes #.

Rationale for this change

Previously, this code tried to turn a &[u8] into a &[u32] without
checking alignment. This means it could and did create misaligned
references, which is UB. This can be detected by running the tests with
-Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host
is). This change adopts the approach from the murmurhash implementation.

The previous implementation also ignored the tail bytes. The loop at the
end treats num_bytes as if it is the full length of the slice, but it
isn't, num_bytes number of bytes after the last 4-byte group. This can
be observed for example by changing "hello" to just "hell" in the tests.
Under the old implementation, the test will still pass. Now, the value
that comes out changes, and "hello" and "hell" hash to different values.

What changes are included in this PR?

A soundness and correctness fix for crc32_hash

Are there any user-facing changes?

I don't know. Can users observe the values that come out of crc32_hash?

Previously, this code tried to turn a &[u8] into a &[u32] without checking alignment. This means it could and did create misaligned references, which is UB. This can be detected by running the tests with -Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host is). This change adopts the approach from the murmurhash implementation. The previous implementation also ignored the tail bytes. The loop at the end treats num_bytes as if it is the full length of the slice, but it isn't, num_bytes number of bytes after the last 4-byte group. This can be observed for example by changing "hello" to just "hell" in the tests. Under the old implementation, the test will still pass. Now, the value that comes out changes, and "hello" and "hell" hash to different values.

codecov-commenter · 2022-06-19T01:18:08Z

Codecov Report

Merging #1906 (5d6bd7c) into master (535cd20) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 5d6bd7c differs from pull request most recent head cba393e. Consider uploading reports for the commit cba393e to get more accurate results

@@            Coverage Diff             @@
##           master    #1906      +/-   ##
==========================================
- Coverage   83.42%   83.42%   -0.01%     
==========================================
  Files         214      214              
  Lines       57025    57018       -7     
==========================================
- Hits        47574    47567       -7     
  Misses       9451     9451

Impacted Files	Coverage Δ
parquet/src/util/hash_util.rs	`95.16% <100.00%> (-0.50%)`	⬇️
arrow/src/datatypes/datatype.rs	`65.42% <0.00%> (-0.38%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (ø)`
parquet/src/encodings/encoding.rs	`93.65% <0.00%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 535cd20...cba393e. Read the comment docs.

tustvold

This looks good to me, I am a bit surprised at how incorrect this method was. Is it actually used anywhere, mainly wondering if we have a gap in our test coverage somewhere...

tustvold · 2022-06-19T07:25:26Z

parquet/src/util/hash_util.rs

-    while offset < num_bytes {
-        hash = _mm_crc32_u8(hash, bytes[offset]);
-        offset += 1;
+    let remainder = bytes.len() % 4;


You could use https://doc.rust-lang.org/std/slice/struct.ChunksExact.html#method.remainder

Previously, this code tried to turn a &[u8] into a &[u32] without checking alignment. This means it could and did create misaligned references, which is UB. This can be detected by running the tests with -Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host is). This change adopts the approach from the murmurhash implementation. The previous implementation also ignored the tail bytes. The loop at the end treats num_bytes as if it is the full length of the slice, but it isn't, num_bytes number of bytes after the last 4-byte group. This can be observed for example by changing "hello" to just "hell" in the tests. Under the old implementation, the test will still pass. Now, the value that comes out changes, and "hello" and "hell" hash to different values.

Previously, this code tried to turn a &[u8] into a &[u32] without checking alignment. This means it could and did create misaligned references, which is UB. This can be detected by running the tests with -Zbuild-std --target=x86_64-unknown-linux-gnu (or whatever your host is). This change adopts the approach from the murmurhash implementation. The previous implementation also ignored the tail bytes. The loop at the end treats num_bytes as if it is the full length of the slice, but it isn't, num_bytes number of bytes after the last 4-byte group. This can be observed for example by changing "hello" to just "hell" in the tests. Under the old implementation, the test will still pass. Now, the value that comes out changes, and "hello" and "hell" hash to different values. Can drop this after rebase on commit ded6316 "Fix misaligned reference and logic error in crc32 (apache#1906)", first released in 17.0.0

github-actions bot added the parquet Changes to the parquet crate label Jun 19, 2022

tustvold approved these changes Jun 19, 2022

View reviewed changes

tustvold merged commit ded6316 into apache:master Jun 19, 2022

mcheshkov mentioned this pull request Jul 16, 2024

Cherry-pick "Fix misaligned reference and logic error in crc32" cube-js/arrow-rs#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix misaligned reference and logic error in crc32 #1906

Fix misaligned reference and logic error in crc32 #1906

saethlin commented Jun 19, 2022 •

edited

Loading

codecov-commenter commented Jun 19, 2022

tustvold left a comment

tustvold Jun 19, 2022

Fix misaligned reference and logic error in crc32 #1906

Fix misaligned reference and logic error in crc32 #1906

Conversation

saethlin commented Jun 19, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jun 19, 2022

Codecov Report

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jun 19, 2022

Choose a reason for hiding this comment

saethlin commented Jun 19, 2022 •

edited

Loading