-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4079
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised any of these data structures are contended, and so dashmap is unlikely to yield benefits here? It may even be slower.
Do you have any benchmark results you could share?
I agree I think it would be good to have some more evidence that using a new third-party library adds benefits -- it would be interesting to know more about the experience in Ballista. |
Or maybe put another way, if a different locking implementation improved performance significantly in Ballista, I wonder if we should investigate ways to avoid locking all together (copy-on-write, for example) 🤔 |
Looks like maybe dashmap was added in apache/datafusion-ballista#319 but there aren't any performance measurements that I could see there |
Actually, it is better in most use cases. You can check . It is quite similar to Java's ConcurrentHashMap. |
Those benchmarks are all for high throughput workloads, in this case I would be extremely surprised to see these hashmaps showing up in profiles. In the absence of a compelling benchmark it is hard for me to approve this... It is a non-trivial additional dependency, not to mention one that I've run into API issues with in the past, for an unclear benefit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it'll improve performance, but the code looks cleaner!
My two cents: I think it is more about encapsulating certain low-level operations regarding the data structure within the data structure itself, vs. exposing them in higher-level code. I think this is probably what @xudong963 means by cleaner, and I agree. I also agree with @tustvold that typical workloads will not result in contention, without which performance improvements will not manifest. If this change would merge, the benefits would be:
The main disadvantage that I can see:
I don't think the argument either way is super strong, but it seems to me that pros outweigh the cons (unless I am missing something). |
I agree this makes the code look cleaner and I will defer to your judgement that the pros outweigh the cons. It looks like this just needs to have the datafusion-cli Cargo.lock file updated and it should be good to go |
Until the fix for #4100 is merged, clippy will be failing on this PR as well |
@yahoNanJing, given that #4100 is merged, this can merge if you resolve the conflicts. |
7b1e7ce
to
27bdb88
Compare
Thanks @ozankabak. Just rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yahoNanJing
Let's plan to merge this tomorrow if we don't hear any additional comments 👍 |
Thanks again everyone! |
Benchmark runs are scheduled for baseline = 4d23cae and contender = a9add0e. a9add0e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #4077 .
Rationale for this change
DashMap
leverages shard-level lock to achieve good performance for high concurrency, like the ConcurrencyHashMap in Java.What changes are included in this PR?
Are there any user-facing changes?