-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Globset uses inefficient hasher #1717
Comments
Your benchmarks are on hash functions, and not on globset. There is maybe a correlation between the performance of the hash function and the performance of globset matching, but it is far from clear. For example, I would bet that keys less than 25 bytes are actually quite common. I think the next step here would be a globset benchmark. I'm pretty sure I used fnv originally because I noticed an improvement over the default, but that was a long time ago and wouldn't mind seeing it relitigated. But as it stands, this issue lacks actionable data. |
Looking at the benchmark code https://github.com/BurntSushi/ripgrep/blob/master/crates/globset/benches/bench.rs#L78
Are both of these realistic? If so the construction cost of the map will completely dominate and the actual hashing algorithm will matter little. If the map really is only likely to be used for a single lookup it may be worth considering another data structure all together. |
I don't think there is anything presently actionable here at the moment. There are likely many different avenues for improving performance in
As far as I can tell, this is not an accurate description of the benchmark. Here is the code as I see it: #[bench]
fn many_short_regex_set(b: &mut test::Bencher) {
let set = new_reglob_many(MANY_SHORT_GLOBS);
b.iter(|| assert_eq!(2, set.matches(MANY_SHORT_SEARCH).iter().count()));
} That creates a single |
Globset currently is using
fnv
hasher for a HashMap that is keyed with aVec<u8>
.This is very inefficient because the
fnv
algorithm requires processing one byte at a time, and cannot be vectorized.On my benchmarks anything over the 25-32 byte range it actually slower than the default sip-hash implementation.
The text was updated successfully, but these errors were encountered: