Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support async reader when reading parquet from hdfs #247

Open
binmahone opened this issue Dec 27, 2022 · 0 comments
Open

support async reader when reading parquet from hdfs #247

binmahone opened this issue Dec 27, 2022 · 0 comments

Comments

@binmahone
Copy link

(you don't have to strictly follow this form)

Use case

async reader can speed up query performance by pre fetching the remote data in hdfs cluster.

Describe the solution you'd like

our solution take advantage of the existing Cache feature of Arrow.
For each file, Arrow Cache will submit M*N IO request to prefetch data, where M is number of row groups for a file, and N is number of columns to be visited.

Two major difference from Arrow Cache

  • Arrow Cache uses a shared thread pool for all IO request, which will cause starving on late submit tasks. We modified to use a separate thread pool (worker thread number = 1) for each task
  • The origin Arrow Cache will keep all the prefetched data in memory, even if for parts that has been consumed. We made a little modification to discard the already consumed part, in order to save memory footprint.

the performance improve in cluster test (3 EC2 node, 48 core spark, tpch 100 data in a separate EMR cluster)

origin_img_v2_acd305d2-2fb1-4d9b-85a5-830e121f7f5g

Describe alternatives you've considered

the above design only do data fetching in async thread.
we also tried doing data fetching + decoding in async thread, but the performance was not good.

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant