Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 343: Reduce metadata memory footprint #335

Merged
merged 55 commits into from
Jul 12, 2024

Conversation

cugni
Copy link
Member

@cugni cugni commented Jun 14, 2024

Description

This PR aims to improve metadata processing during index building by using the Dataset API from Spark SQL. As metadata size increases, especially the number of blocks, all OTree index operations become increasingly expensive.

The changes introduced here try to avoid materializing metadata when not needed.

Notable changes:

  1. Move io.qbeast.core into io.qbeast.spark.core
  2. Remove blocks from CubeStatus to reduce object size
  3. Remove file from Block to avoid recursive reference
  4. loadIndexFiles return DataSet[IndexFile]
  5. Replace .toLocalIterator from IndexStatusBuilder.indexCubeStatuses with collect
  6. Improve test executions

Checklist:

Here is the list of things you should do before submitting this pull request:

  • New feature / bug fix has been committed following the Contribution guide.
  • Test in a large deployment.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

I run this query on this dataset :

spark.sql("SELECT count(*) FROM logs_tpc TABLESAMPLE( 0.1 PERCENT)").show()

With a single node, I've tested:

  1. The previous version (v0.6.0) it took more than 20 seconds, and it produced all these tasks
    image
  2. Then I tested this PR, and it took ~6 seconds, generating only these tasks.
    image

cugni added 4 commits June 14, 2024 22:45
As we begin to handle a significant amount of metadata and transfer
everything to the driver, we start encountering issues.
One such problem stems from an outdated design approach where we
aimed to keep the core of Qbeast independent from Spark. However,
this approach now seems less sensible. If we aim to support different
query engines in the future, it would be more efficient to rewrite
the core classes in the respective languages of those engines.

So what I've done is:
1. Move all classes from the core package to the spark one (only one package)
2. Change some API, so we return Dataset instead of IISeq
3. Change the code to rely as much as possible to Spark to do the
   computation
@cugni cugni requested review from Jiaweihu08 and osopardo1 June 14, 2024 23:48
@cugni cugni marked this pull request as ready for review June 18, 2024 08:25
@cugni cugni marked this pull request as draft June 18, 2024 08:25
@Qbeast-io Qbeast-io deleted a comment from cdelfosse Jul 5, 2024
@cdelfosse cdelfosse added the type: enhancement Improvement of existing feature or code label Jul 8, 2024
@osopardo1 osopardo1 changed the title Dataset api Issue 33: Move query filtering to Dataset API Jul 12, 2024
@osopardo1 osopardo1 changed the title Issue 33: Move query filtering to Dataset API Issue 343: Move query filtering to Dataset API Jul 12, 2024
@osopardo1 osopardo1 changed the title Issue 343: Move query filtering to Dataset API Issue 343: Move index building to Dataset API Jul 12, 2024
@Jiaweihu08 Jiaweihu08 changed the title Issue 343: Move index building to Dataset API Issue 343: Parallelize metadata processing and reduce metadata memory footprint Jul 12, 2024
@Jiaweihu08 Jiaweihu08 changed the title Issue 343: Parallelize metadata processing and reduce metadata memory footprint Issue 343: Reduce metadata memory footprint Jul 12, 2024
@Jiaweihu08 Jiaweihu08 merged commit 29cdb9e into Qbeast-io:main Jul 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants