Issue #455: Next in line Rollup (rebased) #514

cugni · 2024-12-11T15:54:03Z

Fixes #454.

As the code base has changed considerably in the last months, I'm opening a new PR, as rebasing PR #455 has proved difficult.

@Jiaweihu08 if you want to work on this issue, you should work on this PR.

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala

Jiaweihu08 · 2024-12-17T14:17:25Z

Here is a simple illustration of how the two rollup implementations differ.

The existing implementation maps a small cube to its parent cube, while the one proposed here first checks the next sibling cube.

If the number of indexing columns d, is three, each inner cube has 2^d = 8 possible child cubes.

The existing implementation, therefore, can theoretically have up to 8 * (cubeSize - 1) + cubeSize records in a single file - 8 almost full child cubes, with (cubeSize -1) records all rolled up to their full parent cube.

Conversely, this number is reduced to 2 * (cubeSize - 1) + cubeSize for the new implementation - a cube c is merged with one sibling cube from the left and a child cube.

The example here shows how the old version merges all child cubes with their parent while the new implementation first checks the sibling and outputs a more balanced file size(element count).

// Indexing three columns
val df = (spark
	.range(85000)
	.toDF("id_1")
	.withColumn("id_2", rand())
	.withColumn("id_3", rand())
)
(df
	.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "id_1,id_2,id_3")
	.option("cubeSize", "10000")
	.save(path)
)

Old Rollup

Path	Block Element Counts	Element Count	Num Blocks
file_1	[9278, 9371, 9483, 9415, 9370, 9300, 9343, 10112, 9328]	85000	9

New Rollup

Path	Block Element Counts	Element Count	Num Blocks
file_1	[9365, 9449]	18814	2
file_2	[9305, 9416]	18721	2
file_3	[9324, 9342]	18666	2
file_4	[9409, 9242]	18651	2
file_5	[10148]	10148	1

cugni and others added 2 commits December 11, 2024 16:50

Rebasing the changes form Qbeast-io#455

af156fa

Ensure sibling cubes are checked in the correct order(left to right)

2f7a331

Jiaweihu08 marked this pull request as ready for review December 13, 2024 15:21

Rollup is made more efficient

ef3928f

Jiaweihu08 requested a review from osopardo1 December 13, 2024 16:42

Jiaweihu08 mentioned this pull request Dec 13, 2024

Issue #454: Next in line Rollup #455

Closed

5 tasks

Jiaweihu08 changed the title ~~Rebasing the changes form #455: Next in line Rollup~~ Issue #455: Next in line Rollup (rebased) Dec 13, 2024

cugni commented Dec 16, 2024

View reviewed changes

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala Outdated Show resolved Hide resolved

core/src/main/scala/io/qbeast/spark/writer/Rollup.scala Show resolved Hide resolved

Add ordering test

57bb975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue #455: Next in line Rollup (rebased) #514

Issue #455: Next in line Rollup (rebased) #514

cugni commented Dec 11, 2024 •

edited by Jiaweihu08

Loading

Jiaweihu08 commented Dec 17, 2024

Issue #455: Next in line Rollup (rebased) #514

Are you sure you want to change the base?

Issue #455: Next in line Rollup (rebased) #514

Conversation

cugni commented Dec 11, 2024 • edited by Jiaweihu08 Loading

Jiaweihu08 commented Dec 17, 2024

Old Rollup

New Rollup

cugni commented Dec 11, 2024 •

edited by Jiaweihu08

Loading