-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check hashes first during probing the aggr hash table #11718
Conversation
The clickbench result
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Rachelint
// verify that a group that we are inserting with hash is | ||
// actually the same key value as the group in | ||
// existing_idx (aka group_values @ row) | ||
group_rows.row(row) == group_values.row(*group_idx) | ||
target_hash == *exist_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused about how this makes things faster -- I thought that the check for equal hash was done as part of self.map.get_mut
(aka the closure is only called when the hashes are equal, so I would expct this comparison to always be true)
However, if the benchmark results show it is an improvement wonderful.
I'll run the numbers as well to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it either.
I thought the closure (row values equal check) is triggered only if the hash is matched 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused about how this makes things faster -- I thought that the check for equal hash was done as part of
self.map.get_mut
(aka the closure is only called when the hashes are equal, so I would expct this comparison to always be true)However, if the benchmark results show it is an improvement wonderful.
I'll run the numbers as well to confirm.
This is my thought about why faster.
I read the source code in hashbrown, and the get_mut
procedure is like this:
- Use the
hash value
to find the first bucket. - If the bucket is filled, it will use the
eq
function passed by us to check if it is the target, for example, the preveq
function.
|(exist_hash, group_idx)| {
// verify that a group that we are inserting with hash is
// actually the same key value as the group in
// existing_idx (aka group_values @ row)
group_rows.row(row) == group_values.row(*group_idx)
})
- If
eq
return true, it is the target, otherwise we need to prob next and check.
In the high cardinality aggr scenario, the entry often actually not exist in the hash table.
And after the hash table grow too large(many buckets are filled), the prob will perform many times and finally find nothing...
In this sitution, check the hash first can reduce the random memory accesses compared to directy check the group value through group index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jayzhan211 , can see the guess above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Rachelint
My understanding is the similar, collision will happen quite often in get_mut
, equality of u64
hashes will be faster than retrieving / comparing rows.
Hash collisions are usually very low, even for high cardinality, but RawTable::get_mut
doesn't check for equality itself, just finds a first match without guaranteeing hash values are the same (equality check should be in the provided equality function). In other implementations we also check for hash values to be equal first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious -- I think the code @Rachelint is referring to in hashbrown is here
It does appear that collisions could happen (it is doing some sort of abbreviated check by condensing the actual hash value to a byte or something), though I don't fully understand how it works.
I wonder if there are other places we could use this observation 🤔
Good idea! I am trying to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea to check hash before expensive arrow row comparison makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious -- I think the code @Rachelint is referring to in hashbrown is here
It does appear that collisions could happen (it is doing some sort of abbreviated check by condensing the actual hash value to a byte or something), though I don't fully understand how it works.
I wonder if there are other places we could use this observation 🤔
This talk explains the details of this one-byte abbreviation trick https://www.youtube.com/watch?v=ncHmEUmJZf4 , I vaguely remember they said when this 1 byte check is done, it's very likely to find the correct slot.
Looks like it's not working well when the hash table size grows over some threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious -- I think the code @Rachelint is referring to in hashbrown is here
https://github.com/rust-lang/hashbrown/blob/ac00a0bbef46f02f555e235f57ce263aefa361e0/src/raw/mod.rs#L2183-L2199
It does appear that collisions could happen (it is doing some sort of abbreviated check by condensing the actual hash value to a byte or something), though I don't fully understand how it works.
I wonder if there are other places we could use this observation 🤔This talk explains the details of this one-byte abbreviation trick https://www.youtube.com/watch?v=ncHmEUmJZf4 , I vaguely remember they said when this 1 byte check is done, it's very likely to find the correct slot. Looks like it's not working well when the hash table size grows over some threshold?
Seems make sense, maybe we should only add this check when found the hash table size is larger than a threshold 🤔 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this rationale is very non obvious, so I proposed adding some comments on th rationale here #11750
I am running benchmarks against this PR now |
Interestingly my benchmarks against this PR showed more mixed results:
|
@alamb Possible due to my main branch is not the latest? |
Thanks @Rachelint -- I am also trying again to see if I did something wrong -- I also filed #11722 to see if that could be related 🤔 |
Seems so strange, I run it after pulling and rebasing, it still seems faster in the expected case q32
|
Another possibility I could think is that maybe it is due to cache size? This the cpu info about my benchmark machine.
|
I can run the benchmark too |
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ check-hash-first ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.48ms │ 0.44ms │ +1.09x faster │
│ QQuery 1 │ 42.21ms │ 41.12ms │ no change │
│ QQuery 2 │ 74.19ms │ 81.59ms │ 1.10x slower │
│ QQuery 3 │ 62.76ms │ 72.01ms │ 1.15x slower │
│ QQuery 4 │ 421.24ms │ 475.74ms │ 1.13x slower │
│ QQuery 5 │ 702.16ms │ 683.13ms │ no change │
│ QQuery 6 │ 37.37ms │ 40.15ms │ 1.07x slower │
│ QQuery 7 │ 40.82ms │ 40.23ms │ no change │
│ QQuery 8 │ 765.87ms │ 747.78ms │ no change │
│ QQuery 9 │ 663.02ms │ 643.98ms │ no change │
│ QQuery 10 │ 197.35ms │ 196.56ms │ no change │
│ QQuery 11 │ 220.60ms │ 213.23ms │ no change │
│ QQuery 12 │ 726.93ms │ 703.93ms │ no change │
│ QQuery 13 │ 1419.25ms │ 1464.85ms │ no change │
│ QQuery 14 │ 993.33ms │ 993.17ms │ no change │
│ QQuery 15 │ 496.78ms │ 483.43ms │ no change │
│ QQuery 16 │ 2026.78ms │ 1970.75ms │ no change │
│ QQuery 17 │ 1843.15ms │ 1886.30ms │ no change │
│ QQuery 18 │ 4766.27ms │ 4960.04ms │ no change │
│ QQuery 19 │ 58.23ms │ 56.11ms │ no change │
│ QQuery 20 │ 1518.53ms │ 1480.85ms │ no change │
│ QQuery 21 │ 1736.85ms │ 1750.68ms │ no change │
│ QQuery 22 │ 4067.41ms │ 4129.24ms │ no change │
│ QQuery 23 │ 8258.31ms │ 8458.29ms │ no change │
│ QQuery 24 │ 483.89ms │ 502.11ms │ no change │
│ QQuery 25 │ 492.77ms │ 489.45ms │ no change │
│ QQuery 26 │ 549.74ms │ 560.24ms │ no change │
│ QQuery 27 │ 1317.31ms │ 1382.42ms │ no change │
│ QQuery 28 │ 10165.79ms │ 10165.27ms │ no change │
│ QQuery 29 │ 409.71ms │ 399.79ms │ no change │
│ QQuery 30 │ 857.91ms │ 847.23ms │ no change │
│ QQuery 31 │ 983.91ms │ 947.09ms │ no change │
│ QQuery 32 │ 9653.47ms │ 9369.41ms │ no change │
│ QQuery 33 │ 4134.36ms │ 3266.48ms │ +1.27x faster │
│ QQuery 34 │ 3872.37ms │ 3991.76ms │ no change │
│ QQuery 35 │ 1055.52ms │ 1039.10ms │ no change │
│ QQuery 36 │ 144.10ms │ 148.25ms │ no change │
│ QQuery 37 │ 100.81ms │ 101.38ms │ no change │
│ QQuery 38 │ 105.72ms │ 107.32ms │ no change │
│ QQuery 39 │ 387.52ms │ 389.59ms │ no change │
│ QQuery 40 │ 34.77ms │ 34.62ms │ no change │
│ QQuery 41 │ 32.97ms │ 32.96ms │ no change │
│ QQuery 42 │ 43.08ms │ 42.63ms │ no change │
└──────────────┴────────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 65965.62ms │
│ Total Time (check-hash-first) │ 65390.71ms │
│ Average Time (main) │ 1534.08ms │
│ Average Time (check-hash-first) │ 1520.71ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 4 │
│ Queries with No Change │ 37 │
└─────────────────────────────────┴────────────┘ hmm... |
@jayzhan211 interesting... q33 1.27x faster... I run q32 in I am running the whole queries now. |
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ check-hash-first ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.40ms │ 0.41ms │ no change │
│ QQuery 1 │ 38.61ms │ 37.93ms │ no change │
│ QQuery 2 │ 73.86ms │ 73.49ms │ no change │
│ QQuery 3 │ 63.24ms │ 64.05ms │ no change │
│ QQuery 4 │ 425.21ms │ 390.92ms │ +1.09x faster │
│ QQuery 5 │ 694.98ms │ 650.10ms │ +1.07x faster │
│ QQuery 6 │ 37.20ms │ 38.77ms │ no change │
│ QQuery 7 │ 38.30ms │ 37.54ms │ no change │
│ QQuery 8 │ 757.20ms │ 663.71ms │ +1.14x faster │
│ QQuery 9 │ 654.19ms │ 628.15ms │ no change │
│ QQuery 10 │ 193.07ms │ 184.71ms │ no change │
│ QQuery 11 │ 215.94ms │ 209.10ms │ no change │
│ QQuery 12 │ 728.18ms │ 689.16ms │ +1.06x faster │
│ QQuery 13 │ 1386.26ms │ 1318.11ms │ no change │
│ QQuery 14 │ 975.30ms │ 955.42ms │ no change │
│ QQuery 15 │ 487.11ms │ 472.50ms │ no change │
│ QQuery 16 │ 1799.89ms │ 1582.73ms │ +1.14x faster │
│ QQuery 17 │ 1673.83ms │ 1550.87ms │ +1.08x faster │
│ QQuery 18 │ 4557.05ms │ 3911.50ms │ +1.17x faster │
│ QQuery 19 │ 57.09ms │ 57.37ms │ no change │
│ QQuery 20 │ 1540.04ms │ 1538.60ms │ no change │
│ QQuery 21 │ 1764.33ms │ 1767.64ms │ no change │
│ QQuery 22 │ 4006.80ms │ 4007.95ms │ no change │
│ QQuery 23 │ 8261.65ms │ 7982.00ms │ no change │
│ QQuery 24 │ 492.90ms │ 474.27ms │ no change │
│ QQuery 25 │ 472.54ms │ 480.76ms │ no change │
│ QQuery 26 │ 539.63ms │ 537.50ms │ no change │
│ QQuery 27 │ 1291.92ms │ 1291.64ms │ no change │
│ QQuery 28 │ 10082.99ms │ 9922.57ms │ no change │
│ QQuery 29 │ 403.20ms │ 403.61ms │ no change │
│ QQuery 30 │ 833.31ms │ 816.01ms │ no change │
│ QQuery 31 │ 966.80ms │ 905.63ms │ +1.07x faster │
│ QQuery 32 │ 9874.88ms │ 9189.33ms │ +1.07x faster │
│ QQuery 33 │ 3504.33ms │ 3956.20ms │ 1.13x slower │
│ QQuery 34 │ 3501.75ms │ 3635.95ms │ no change │
│ QQuery 35 │ 1069.16ms │ 957.72ms │ +1.12x faster │
│ QQuery 36 │ 148.00ms │ 140.89ms │ no change │
│ QQuery 37 │ 100.97ms │ 99.02ms │ no change │
│ QQuery 38 │ 103.56ms │ 103.36ms │ no change │
│ QQuery 39 │ 378.87ms │ 373.10ms │ no change │
│ QQuery 40 │ 33.78ms │ 33.72ms │ no change │
│ QQuery 41 │ 32.01ms │ 32.04ms │ no change │
│ QQuery 42 │ 40.22ms │ 40.03ms │ no change │
└──────────────┴────────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 64300.55ms │
│ Total Time (check-hash-first) │ 62206.11ms │
│ Average Time (main) │ 1495.36ms │
│ Average Time (check-hash-first) │ 1446.65ms │
│ Queries Faster │ 10 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 32 │
└─────────────────────────────────┴────────────┘ I guess the change is not really related to those queries, we might need specialized query that could really benefit from it. |
@jayzhan211 Maybe hash table size should be considered, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran again with release
mode (#11722) and it seems to be showing an improvement in Q32 as well and several of the other
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main_base ┃ check-hash-first ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.69ms │ 0.71ms │ no change │
│ QQuery 1 │ 91.81ms │ 89.52ms │ no change │
│ QQuery 2 │ 195.10ms │ 190.79ms │ no change │
│ QQuery 3 │ 203.15ms │ 198.03ms │ no change │
│ QQuery 4 │ 2230.05ms │ 2237.94ms │ no change │
│ QQuery 5 │ 1984.18ms │ 2032.57ms │ no change │
│ QQuery 6 │ 78.88ms │ 79.93ms │ no change │
│ QQuery 7 │ 94.95ms │ 91.81ms │ no change │
│ QQuery 8 │ 3227.81ms │ 3011.43ms │ +1.07x faster │
│ QQuery 9 │ 2384.85ms │ 2360.24ms │ no change │
│ QQuery 10 │ 847.94ms │ 847.07ms │ no change │
│ QQuery 11 │ 910.10ms │ 915.36ms │ no change │
│ QQuery 12 │ 2139.05ms │ 2156.18ms │ no change │
│ QQuery 13 │ 4697.20ms │ 4549.50ms │ no change │
│ QQuery 14 │ 2952.47ms │ 2902.65ms │ no change │
│ QQuery 15 │ 2450.43ms │ 2445.44ms │ no change │
│ QQuery 16 │ 5974.80ms │ 5777.67ms │ no change │
│ QQuery 17 │ 5895.89ms │ 5650.60ms │ no change │
│ QQuery 18 │ 12165.50ms │ 11615.76ms │ no change │
│ QQuery 19 │ 168.33ms │ 169.17ms │ no change │
│ QQuery 20 │ 2749.31ms │ 2709.84ms │ no change │
│ QQuery 21 │ 3535.36ms │ 3495.37ms │ no change │
│ QQuery 22 │ 9567.54ms │ 9511.10ms │ no change │
│ QQuery 23 │ 22300.42ms │ 22589.13ms │ no change │
│ QQuery 24 │ 1365.25ms │ 1376.12ms │ no change │
│ QQuery 25 │ 1176.10ms │ 1176.65ms │ no change │
│ QQuery 26 │ 1487.44ms │ 1506.23ms │ no change │
│ QQuery 27 │ 4022.80ms │ 3984.70ms │ no change │
│ QQuery 28 │ 30287.63ms │ 29893.82ms │ no change │
│ QQuery 29 │ 1059.72ms │ 1045.76ms │ no change │
│ QQuery 30 │ 2503.34ms │ 2480.10ms │ no change │
│ QQuery 31 │ 3208.06ms │ 3070.93ms │ no change │
│ QQuery 32 │ 17948.66ms │ 16576.11ms │ +1.08x faster │
│ QQuery 33 │ 9600.79ms │ 9552.98ms │ no change │
│ QQuery 34 │ 9703.76ms │ 9574.08ms │ no change │
│ QQuery 35 │ 3786.65ms │ 3743.23ms │ no change │
│ QQuery 36 │ 345.94ms │ 344.00ms │ no change │
│ QQuery 37 │ 232.55ms │ 229.18ms │ no change │
│ QQuery 38 │ 193.74ms │ 199.95ms │ no change │
│ QQuery 39 │ 1149.55ms │ 1174.42ms │ no change │
│ QQuery 40 │ 87.70ms │ 92.48ms │ 1.05x slower │
│ QQuery 41 │ 79.73ms │ 81.89ms │ no change │
│ QQuery 42 │ 95.45ms │ 98.37ms │ no change │
└──────────────┴────────────┴──────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base) │ 175180.68ms │
│ Total Time (check-hash-first) │ 171828.82ms │
│ Average Time (main_base) │ 4073.97ms │
│ Average Time (check-hash-first) │ 3996.02ms │
│ Queries Faster │ 2 │
│ Queries Slower │ 1 │
│ Queries with No Change │ 40 │
└─────────────────────────────────┴─────────────┘
So I think this looks good to me 👍
Thanks a lot @Rachelint and everyone
Here is another run showing improvement: Details
|
Thanks again @Rachelint @Dandandan and @2010YOUY01 |
Duckdb has the similar check(but just use u16 prefix of hash): And mentioned by @alamb , the related pr in duckdb is interesting to see |
Which issue does this PR close?
Closes #11717
Rationale for this change
See #11717
What changes are included in this PR?
See title.
Are these changes tested?
By exist tests.
Are there any user-facing changes?
No.