Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert to Qbeast #152

Merged
merged 79 commits into from
Jan 27, 2023
Merged

Convert to Qbeast #152

merged 79 commits into from
Jan 27, 2023

Conversation

Jiaweihu08
Copy link
Member

@Jiaweihu08 Jiaweihu08 commented Jan 25, 2023

Description

This PR adds the ability to read a hybrid qbeast + delta table using qbeast.

It also introduces the ConverToQbeastCommand to qbeast-spark to allow reading a parquet or delta table without indexing and rewriting. Partitioned table are not supported by this operation.

These features are achieved by putting the non-qbeast AddFiles in a staging Revision, which is created during the first qbeast write (including overwrites), or when running the conversion command. The non-qbeast files are at the moment characterized as having null tags, and are all placed in the root of the staging Revision.

During reads, these files are only processed in memory(for now), no filtering at the file level is done for all files are in the root.

import io.qbeast.spark.internal.commands.ConvertToQbeastCommand

val path = "/pathToTable/"
val tableIdentifier = s"parquet.`$path`"
val columnsToIndex = Seq("col1", "col2", "col3")
val desiredCubeSize = 50000

ConvertToQbeastCommand(tableIdentifier, columnsToIndex, desiredCubeSize).run(spark)

The converted table can be read using either delta or qbeast, and appending to the converted table using delta puts the data into the staging Revision without indexing. Appends that use qbeast follow the usual qbeast indexing procedure.

Compaction can be executed on the staging revision to reduce its number of files.

Fixes #102, #121, #149

This PR also makes the following changes:

  • Compaction with dataChange=false as default
  • No duplicated RemoveFile in DeltaMetadataWriter

Type of change

  • When first writing or when overwriting a table using qbeast, aside from the creation of the revision it also adds a staging revision. This has EmptyTransformers and EmptyTransformations, and most importantly RevisionID = 0
  • Documentation update to QbeastFormat

Checklist:

  • New feature / bug fix has been committed following the Contribution guide.
  • Add comments to the code (make it easier for the community!).
  • Change the documentation.
  • Add tests.
  • Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

  • The creation of the staging revision during first writes
  • The same when running ConvertToQbeatCommand
  • Test ConvertToQbeatCommand on delta and parquet tables, partitoned and not partitioned
  • Test Compaction on converted table and revisionID = stagingID
  • Appending on a converted table creates a now non-staging Revision

Test Configuration:

  • Spark Version: 3.2.2
  • Hadoop Version: 3.3.1
  • Cluster or local? Local

osopardo1 and others added 30 commits May 10, 2022 14:47
@Jiaweihu08 Jiaweihu08 added the type: enhancement Improvement of existing feature or code label Jan 25, 2023
@Jiaweihu08 Jiaweihu08 requested a review from osopardo1 January 25, 2023 10:57
@Jiaweihu08 Jiaweihu08 self-assigned this Jan 25, 2023
@codecov
Copy link

codecov bot commented Jan 25, 2023

Codecov Report

Merging #152 (1332bc9) into main (99a0fb3) will increase coverage by 0.37%.
The diff coverage is 98.73%.

@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
+ Coverage   93.18%   93.56%   +0.37%     
==========================================
  Files          76       80       +4     
  Lines        1775     1879     +104     
  Branches      133      146      +13     
==========================================
+ Hits         1654     1758     +104     
  Misses        121      121              
Impacted Files Coverage Δ
...la/io/qbeast/spark/index/query/QueryExecutor.scala 97.05% <ø> (ø)
...la/io/qbeast/spark/internal/rules/SampleRule.scala 89.47% <ø> (ø)
...n/scala/io/qbeast/core/model/RevisionClasses.scala 77.35% <90.90%> (+3.54%) ⬆️
...o/qbeast/spark/delta/QbeastMetadataOperation.scala 72.85% <95.65%> (+3.01%) ⬆️
...io/qbeast/core/transform/EmptyTransformation.scala 100.00% <100.00%> (ø)
...la/io/qbeast/core/transform/EmptyTransformer.scala 100.00% <100.00%> (ø)
src/main/scala/io/qbeast/spark/QbeastTable.scala 95.34% <100.00%> (+0.28%) ⬆️
...la/io/qbeast/spark/delta/DeltaMetadataWriter.scala 93.75% <100.00%> (+1.02%) ⬆️
...la/io/qbeast/spark/delta/DeltaQbeastSnapshot.scala 93.75% <100.00%> (+4.09%) ⬆️
...ala/io/qbeast/spark/delta/IndexStatusBuilder.scala 100.00% <100.00%> (ø)
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@osopardo1
Copy link
Member

osopardo1 commented Jan 25, 2023

Hello! I did the first review, feel free to discuss any of my comments in the thread.

One more suggestion: I think it's better to call it Convert To Qbeast (Or incremental conversion to qbeast). Compatibility might be a confusing word (since we are already compatible) and the scope of this PR is actually being able to convert the table. The hybrid support is an inevitable change for the final result.

@Jiaweihu08 Jiaweihu08 changed the title Qbeast-delta compatibility Conver to Qbeast Jan 25, 2023
@osopardo1 osopardo1 changed the title Conver to Qbeast Convert to Qbeast Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
2 participants