You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
async reader can speed up query performance by pre fetching the remote data in hdfs cluster.
Describe the solution you'd like
our solution take advantage of the existing Cache feature of Arrow.
For each file, Arrow Cache will submit M*N IO request to prefetch data, where M is number of row groups for a file, and N is number of columns to be visited.
Two major difference from Arrow Cache
Arrow Cache uses a shared thread pool for all IO request, which will cause starving on late submit tasks. We modified to use a separate thread pool (worker thread number = 1) for each task
The origin Arrow Cache will keep all the prefetched data in memory, even if for parts that has been consumed. We made a little modification to discard the already consumed part, in order to save memory footprint.
the performance improve in cluster test (3 EC2 node, 48 core spark, tpch 100 data in a separate EMR cluster)
Describe alternatives you've considered
the above design only do data fetching in async thread.
we also tried doing data fetching + decoding in async thread, but the performance was not good.
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
Use case
async reader can speed up query performance by pre fetching the remote data in hdfs cluster.
Describe the solution you'd like
our solution take advantage of the existing Cache feature of Arrow.
For each file, Arrow Cache will submit M*N IO request to prefetch data, where M is number of row groups for a file, and N is number of columns to be visited.
Two major difference from Arrow Cache
the performance improve in cluster test (3 EC2 node, 48 core spark, tpch 100 data in a separate EMR cluster)
Describe alternatives you've considered
the above design only do data fetching in async thread.
we also tried doing data fetching + decoding in async thread, but the performance was not good.
Additional context
The text was updated successfully, but these errors were encountered: