Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace custom ParquetSstReader to DataFusion's ParquetExec #291

Closed
jiacai2050 opened this issue Oct 11, 2022 · 6 comments
Closed

replace custom ParquetSstReader to DataFusion's ParquetExec #291

jiacai2050 opened this issue Oct 11, 2022 · 6 comments
Assignees
Labels
feature New feature or request

Comments

@jiacai2050
Copy link
Contributor

jiacai2050 commented Oct 11, 2022

Describe This Problem

In current implementation

Proposal

By leverage DataFusion's ParquetExec to implement this, we can remove all those low-level details and it's already feature rich, for example:

  • Support predicate pushdown
  • Support custom ParquetFileReaderFactory, which can be used to implement row-group level cache, just as what we do now

Additional Context

IOx already does this.

@jiacai2050 jiacai2050 added the feature New feature or request label Oct 11, 2022
@jiacai2050 jiacai2050 self-assigned this Oct 11, 2022
@ygf11
Copy link
Contributor

ygf11 commented Oct 11, 2022

Seems this issue is relative to:

In #256, we want to implement filter logics with datafusion's expr.

If we replace ParquetSstReader with ParquetExec, I think the filter logic is also there.

@jiacai2050
Copy link
Contributor Author

@ygf11 yeah, those issue are related, but they work in different levels.

In current implementation, where condition is filtered multiple times,

  • The underlying one is what this issue concerns, predicate works on sst file
  • The next one is Support complex filter before merge procedure #256, predicate works before pass to MergeIterator, which is used to merge multiple source into one deduplicated.
  • The last one is FilterExec, which utilize DataFusion to ensure all rows match where condition

@ygf11
Copy link
Contributor

ygf11 commented Oct 12, 2022

yeah, those issue are related, but they work in different levels.

Thanks, I almost got it. If I have mistake, please take it out.

First, We will read data from two places, Memtables and Sst files. And we will read sst with ParquetExec after this task.

Predicate works before pass to MergeIterator:

Further more, after #256 and this task, we can remove FilterExec, since we have pushed down all filters.

@ShiKaiWi
Copy link
Member

ShiKaiWi commented Oct 12, 2022

There are some mistakes here:

For sst files, we can directly pass filter expr to ParquetExec.

This is true, but one thing worth mentioning that the data is not ensured to meet the requirements by pushed down predicates (actually the filter is implemented by utilizing the min-max index on the row-group level.)

For Memtables, we still need implement filter logics manually, that's what #256 need do.

So the further filter works for not only the memtables but also the data from sst files. In conclusion, both #256 and #291 are necessary.

@ygf11
Copy link
Contributor

ygf11 commented Oct 12, 2022

This is true, but one thing worth mentioning that the data is not ensured to meet the requirements by pushed down predicates (actually the filter is implemented by utilizing the min-max index on the row-group level.)

Thanks for figuring it out. It is clear now :D.

@jiacai2050
Copy link
Contributor Author

Already fixed in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
3 participants