-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster StringDictionaryBuilder (~60% faster) (#1851) #1861
Faster StringDictionaryBuilder (~60% faster) (#1851) #1861
Conversation
7aed1fe
to
14bbd49
Compare
arrow/Cargo.toml
Outdated
serde = { version = "1.0" } | ||
serde_derive = "1.0" | ||
serde_json = { version = "1.0", features = ["preserve_order"] } | ||
hashbrown = "0.12" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "technically" a new dependency as indexmap depends on hashbrown = 0.11
, I wasn't sure whether to use an out of data dependency or not here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it is time to submit a PR to update index map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been an open issue since February - indexmap-rs/indexmap#217
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update here -- I spent a few minutes looking into the use of indexmap
-- I think with a small amount of effort we could remove that dependency. I'll write up a ticket and maybe a PR for it shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow/Cargo.toml
Outdated
@@ -38,9 +38,11 @@ path = "src/lib.rs" | |||
bench = false | |||
|
|||
[dependencies] | |||
ahash = "0.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new dependency, although it is the default hash function for hashbrown so...
14bbd49
to
c054466
Compare
c054466
to
0bffee1
Compare
0bffee1
to
8032029
Compare
8032029
to
0e893cf
Compare
@@ -107,6 +107,11 @@ where | |||
&mut self.values_builder | |||
} | |||
|
|||
/// Returns the child array builder as an immutable reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These APIs are effectively part of #1860
0e893cf
to
75f4945
Compare
@@ -38,13 +38,15 @@ path = "src/lib.rs" | |||
bench = false | |||
|
|||
[dependencies] | |||
ahash = { version = "0.7", default-features = false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding these new dependencies are my only hesitation for this PR.
Any other thoughts from maintainers or users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hashbrown isn't a new dependency as it is used by indexmap (and the standard library). It also by default brings in ahash so that I think isn't new either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean that hashbrown is used by the standard library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hashmap implementation in the standard library is hashbrown since Rust 1.36 - https://blog.rust-lang.org/2019/07/04/Rust-1.36.0.html#a-new-hashmapk-v-implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard library includes its own version though (so it won't pull in the same version / dependency as adding the dependency in a project does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was being a bit disingenuous. I more meant from the perspective, "can I rely on this to be well-supported going forward", etc... it is pretty safe 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will find time to review this PR in detail in a day or two. I think the dependency issue has been resolved (as in the new dependencies are ok, as they were transitive dependencies anyways)
@alamb if it helps this is a direct port of the string interner I added to IOx in https://github.com/influxdata/influxdb_iox/pull/1273 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 👍
for (idx, maybe_value) in dictionary_values.iter().enumerate() { | ||
match maybe_value { | ||
Some(value) => { | ||
let hash = compute_hash(&state, value.as_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own curiosity, I stared at this for a while
I would describie this as using HashMap
as a HashSet
of pointers (indexes) to canonical string values in StringBuilder
. It then uses the low level hash brown APIs to checks if the same string has previously been inserted.
Very nice
I wondered a little why we needed to store any value for key
🤔 at all, but then I realized it is needed so we the code can find the corresponding string value
Which issue does this PR close?
Closes #1851
Relates to #1843
Rationale for this change
StringDictionaryBuilder can be made significantly faster
What changes are included in this PR?
There are two major changes in this PR
The first is ~40% uplift regardless of data shape, the latter adds a further performance improvement ranging from ~10-20% depending on the dictionary size.
Are there any user-facing changes?
No