Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap #4079

Merged
merged 2 commits into from
Nov 7, 2022

Conversation

yahoNanJing
Copy link
Contributor

@yahoNanJing yahoNanJing commented Nov 2, 2022

Which issue does this PR close?

Closes #4077 .

Rationale for this change

DashMap leverages shard-level lock to achieve good performance for high concurrency, like the ConcurrencyHashMap in Java.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 2, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised any of these data structures are contended, and so dashmap is unlikely to yield benefits here? It may even be slower.

Do you have any benchmark results you could share?

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

I agree I think it would be good to have some more evidence that using a new third-party library adds benefits -- it would be interesting to know more about the experience in Ballista.

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Or maybe put another way, if a different locking implementation improved performance significantly in Ballista, I wonder if we should investigate ways to avoid locking all together (copy-on-write, for example) 🤔

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

Looks like maybe dashmap was added in apache/datafusion-ballista#319 but there aren't any performance measurements that I could see there

@metesynnada
Copy link
Contributor

I'm surprised any of these data structures are contended, and so dashmap is unlikely to yield benefits here? It may even be slower.

Do you have any benchmark results you could share?

Actually, it is better in most use cases. You can check . It is quite similar to Java's ConcurrentHashMap.

@tustvold
Copy link
Contributor

tustvold commented Nov 2, 2022

Those benchmarks are all for high throughput workloads, in this case I would be extremely surprised to see these hashmaps showing up in profiles. In the absence of a compelling benchmark it is hard for me to approve this... It is a non-trivial additional dependency, not to mention one that I've run into API issues with in the past, for an unclear benefit

@yahoNanJing
Copy link
Contributor Author

Thanks @alamb and @tustvold, I'll try to add some benchmark testing before this PR be accepted.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it'll improve performance, but the code looks cleaner!

@ozankabak
Copy link
Contributor

ozankabak commented Nov 3, 2022

My two cents: I think it is more about encapsulating certain low-level operations regarding the data structure within the data structure itself, vs. exposing them in higher-level code. I think this is probably what @xudong963 means by cleaner, and I agree.

I also agree with @tustvold that typical workloads will not result in contention, without which performance improvements will not manifest.

If this change would merge, the benefits would be:

  1. Code with better separation of concern (although the benefit is not as much as the Ballista change)
  2. Potential performance gain in the unrealistic-looking contention scenarios
  3. Being more in-line with Ballista codebase

The main disadvantage that I can see:

  1. Introducing a dependency to Datafusion (albeit it is fairly commonly-used one)

I don't think the argument either way is super strong, but it seems to me that pros outweigh the cons (unless I am missing something).

@alamb
Copy link
Contributor

alamb commented Nov 3, 2022

I agree this makes the code look cleaner and I will defer to your judgement that the pros outweigh the cons. It looks like this just needs to have the datafusion-cli Cargo.lock file updated and it should be good to go

@alamb
Copy link
Contributor

alamb commented Nov 3, 2022

Until the fix for #4100 is merged, clippy will be failing on this PR as well

@ozankabak
Copy link
Contributor

@yahoNanJing, given that #4100 is merged, this can merge if you resolve the conflicts.

@yahoNanJing
Copy link
Contributor Author

Thanks @ozankabak. Just rebased.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yahoNanJing

@xudong963 xudong963 requested a review from tustvold November 6, 2022 07:06
@alamb
Copy link
Contributor

alamb commented Nov 6, 2022

Let's plan to merge this tomorrow if we don't hear any additional comments 👍

@alamb alamb merged commit a9add0e into apache:master Nov 7, 2022
@alamb
Copy link
Contributor

alamb commented Nov 7, 2022

Thanks again everyone!

@ursabot
Copy link

ursabot commented Nov 7, 2022

Benchmark runs are scheduled for baseline = 4d23cae and contender = a9add0e. a9add0e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace RwLock<HashMap> and Mutex<HashMap> by using DashMap
8 participants