Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #455: Next in line Rollup (rebased) #514

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

cugni
Copy link
Member

@cugni cugni commented Dec 11, 2024

Fixes #454.

As the code base has changed considerably in the last months, I'm opening a new PR, as rebasing PR #455 has proved difficult.

@Jiaweihu08 if you want to work on this issue, you should work on this PR.

@Jiaweihu08 Jiaweihu08 marked this pull request as ready for review December 13, 2024 15:21
@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 December 13, 2024 16:42
@Jiaweihu08 Jiaweihu08 mentioned this pull request Dec 13, 2024
5 tasks
@Jiaweihu08 Jiaweihu08 changed the title Rebasing the changes form #455: Next in line Rollup Issue #455: Next in line Rollup (rebased) Dec 13, 2024
@Jiaweihu08
Copy link
Member

Here is a simple illustration of how the two rollup implementations differ.

The existing implementation maps a small cube to its parent cube, while the one proposed here first checks the next sibling cube.

If the number of indexing columns d, is three, each inner cube has 2^d = 8 possible child cubes.

The existing implementation, therefore, can theoretically have up to 8 * (cubeSize - 1) + cubeSize records in a single file - 8 almost full child cubes, with (cubeSize -1) records all rolled up to their full parent cube.

Conversely, this number is reduced to 2 * (cubeSize - 1) + cubeSize for the new implementation - a cube c is merged with one sibling cube from the left and a child cube.

The example here shows how the old version merges all child cubes with their parent while the new implementation first checks the sibling and outputs a more balanced file size(element count).

// Indexing three columns
val df = (spark
	.range(85000)
	.toDF("id_1")
	.withColumn("id_2", rand())
	.withColumn("id_3", rand())
)
(df
	.write
	.mode("overwrite")
	.format("qbeast")
	.option("columnsToIndex", "id_1,id_2,id_3")
	.option("cubeSize", "10000")
	.save(path)
)

Old Rollup

Path Block Element Counts Element Count Num Blocks
file_1 [9278, 9371, 9483, 9415, 9370, 9300, 9343, 10112, 9328] 85000 9

New Rollup

Path Block Element Counts Element Count Num Blocks
file_1 [9365, 9449] 18814 2
file_2 [9305, 9416] 18721 2
file_3 [9324, 9342] 18666 2
file_4 [9409, 9242] 18651 2
file_5 [10148] 10148 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rollup performance on high dimensional space
2 participants