support async reader when reading parquet from hdfs #247

binmahone · 2022-12-27T08:52:39Z

(you don't have to strictly follow this form)

Use case

async reader can speed up query performance by pre fetching the remote data in hdfs cluster.

Describe the solution you'd like

our solution take advantage of the existing Cache feature of Arrow.
For each file, Arrow Cache will submit M*N IO request to prefetch data, where M is number of row groups for a file, and N is number of columns to be visited.

Two major difference from Arrow Cache

Arrow Cache uses a shared thread pool for all IO request, which will cause starving on late submit tasks. We modified to use a separate thread pool (worker thread number = 1) for each task
The origin Arrow Cache will keep all the prefetched data in memory, even if for parts that has been consumed. We made a little modification to discard the already consumed part, in order to save memory footprint.

the performance improve in cluster test (3 EC2 node, 48 core spark, tpch 100 data in a separate EMR cluster)

Describe alternatives you've considered

the above design only do data fetching in async thread.
we also tried doing data fetching + decoding in async thread, but the performance was not good.

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Add any other context or screenshots about the feature request here.

binmahone mentioned this issue Dec 27, 2022

[CH-247] support async reader when reading parquet from hdfs #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support async reader when reading parquet from hdfs #247

support async reader when reading parquet from hdfs #247

binmahone commented Dec 27, 2022

support async reader when reading parquet from hdfs #247

support async reader when reading parquet from hdfs #247

Comments

binmahone commented Dec 27, 2022