Issue 343: Reduce metadata memory footprint #335

cugni · 2024-06-14T23:48:29Z

Description

This PR aims to improve metadata processing during index building by using the Dataset API from Spark SQL. As metadata size increases, especially the number of blocks, all OTree index operations become increasingly expensive.

The changes introduced here try to avoid materializing metadata when not needed.

Notable changes:

Move io.qbeast.core into io.qbeast.spark.core
Remove blocks from CubeStatus to reduce object size
Remove file from Block to avoid recursive reference
loadIndexFiles return DataSet[IndexFile]
Replace .toLocalIterator from IndexStatusBuilder.indexCubeStatuses with collect
Improve test executions

Checklist:

Here is the list of things you should do before submitting this pull request:

New feature / bug fix has been committed following the Contribution guide.
Test in a large deployment.
Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

I run this query on this dataset :

spark.sql("SELECT count(*) FROM logs_tpc TABLESAMPLE( 0.1 PERCENT)").show()

With a single node, I've tested:

The previous version (v0.6.0) it took more than 20 seconds, and it produced all these tasks
Then I tested this PR, and it took ~6 seconds, generating only these tasks.

As we begin to handle a significant amount of metadata and transfer everything to the driver, we start encountering issues. One such problem stems from an outdated design approach where we aimed to keep the core of Qbeast independent from Spark. However, this approach now seems less sensible. If we aim to support different query engines in the future, it would be more efficient to rewrite the core classes in the respective languages of those engines. So what I've done is: 1. Move all classes from the core package to the spark one (only one package) 2. Change some API, so we return Dataset instead of IISeq 3. Change the code to rely as much as possible to Spark to do the computation

.github/workflows/test-artifact.yml

src/main/scala/io/qbeast/spark/delta/IndexStatusBuilder.scala

src/test/scala/io/qbeast/spark/utils/IndexMathOpsTest.scala

src/main/scala/io/qbeast/spark/delta/DeltaQbeastSnapshot.scala

src/main/scala/io/qbeast/spark/index/OTreeDataAnalyzer.scala

src/main/scala/io/qbeast/spark/index/query/QueryExecutor.scala

src/main/scala/org/apache/spark/sql/DataframeUtils.scala

…eStatus there (the data is not indexed).

src/main/scala/io/qbeast/spark/QbeastTable.scala

src/main/scala/io/qbeast/spark/index/query/QueryExecutor.scala

cugni added 4 commits June 14, 2024 22:45

wip

e2efadb

fixing test

3287d9b

Merge branch 'main' into dataset-api

ff93a37

cugni requested review from Jiaweihu08 and osopardo1 June 14, 2024 23:48

cugni added 2 commits June 15, 2024 10:10

removing not used code

04d1444

fixing test in CI

1063c05

cdelfosse assigned cugni Jun 17, 2024

committing moved classes

f315ee3

cugni marked this pull request as ready for review June 18, 2024 08:25

cugni marked this pull request as draft June 18, 2024 08:25

cugni and others added 9 commits June 18, 2024 11:45

trying to clean cached to fix CI

9a5710d

debugging

a78d7ad

ugly debugging

c9bc1b6

Merge branch 'main' into dataset-api

741ca92

wip x jiawei

c82876e

First slow but working version.

9482ff3

Correct average fanout

cede10e

Remove redundant tests

6e0c903

wip

ed020ba

Qbeast-io deleted a comment from cdelfosse Jul 5, 2024

Jiaweihu08 added 3 commits July 5, 2024 10:52

Change datatype

e305ba3

Use collect

18447ba

Correct isLeaf and add tests

8a07d43

cdelfosse added the type: enhancement Improvement of existing feature or code label Jul 8, 2024

cugni commented Jul 9, 2024

View reviewed changes

.github/workflows/test-artifact.yml Outdated Show resolved Hide resolved

cugni commented Jul 9, 2024

View reviewed changes

src/main/scala/io/qbeast/spark/delta/IndexStatusBuilder.scala Outdated Show resolved Hide resolved

src/test/scala/io/qbeast/spark/utils/IndexMathOpsTest.scala Outdated Show resolved Hide resolved

Jiaweihu08 added 2 commits July 9, 2024 10:56

Use toLocalIterator instead of collect

f016578

Correct test

afb97b7

cugni added 2 commits July 11, 2024 16:18

no new

28c487e

code cleaning

02a547d

cugni force-pushed the dataset-api branch from c911ad3 to 02a547d Compare July 11, 2024 14:22

cugni added 2 commits July 11, 2024 16:29

no prints

6a5cc9f

removing union, as it fails when one dataframe is empty

27f738d

osopardo1 mentioned this pull request Jul 12, 2024

Metadata time in queries with Qbeast Datasource is higher than expected #320

Closed