Convert to Qbeast #152

Jiaweihu08 · 2023-01-25T10:57:59Z

Description

This PR adds the ability to read a hybrid qbeast + delta table using qbeast.

It also introduces the ConverToQbeastCommand to qbeast-spark to allow reading a parquet or delta table without indexing and rewriting. Partitioned table are not supported by this operation.

These features are achieved by putting the non-qbeast AddFiles in a staging Revision, which is created during the first qbeast write (including overwrites), or when running the conversion command. The non-qbeast files are at the moment characterized as having null tags, and are all placed in the root of the staging Revision.

During reads, these files are only processed in memory(for now), no filtering at the file level is done for all files are in the root.

import io.qbeast.spark.internal.commands.ConvertToQbeastCommand

val path = "/pathToTable/"
val tableIdentifier = s"parquet.`$path`"
val columnsToIndex = Seq("col1", "col2", "col3")
val desiredCubeSize = 50000

ConvertToQbeastCommand(tableIdentifier, columnsToIndex, desiredCubeSize).run(spark)

The converted table can be read using either delta or qbeast, and appending to the converted table using delta puts the data into the staging Revision without indexing. Appends that use qbeast follow the usual qbeast indexing procedure.

Compaction can be executed on the staging revision to reduce its number of files.

Fixes #102, #121, #149

This PR also makes the following changes:

Compaction with dataChange=false as default
No duplicated RemoveFile in DeltaMetadataWriter

Type of change

When first writing or when overwriting a table using qbeast, aside from the creation of the revision it also adds a staging revision. This has EmptyTransformers and EmptyTransformations, and most importantly RevisionID = 0
Documentation update to QbeastFormat

Checklist:

New feature / bug fix has been committed following the Contribution guide.
Add comments to the code (make it easier for the community!).
Change the documentation.
Add tests.
Your branch is updated to the main branch (dependent changes have been merged).

How Has This Been Tested? (Optional)

The creation of the staging revision during first writes
The same when running ConvertToQbeatCommand
Test ConvertToQbeatCommand on delta and parquet tables, partitoned and not partitioned
Test Compaction on converted table and revisionID = stagingID
Appending on a converted table creates a now non-staging Revision

Test Configuration:

Spark Version: 3.2.2
Hadoop Version: 3.3.1
Cluster or local? Local

…d from conversion

…format of the table

…rr schema for partitioned parquet files

codecov · 2023-01-25T11:43:00Z

Codecov Report

Merging #152 (1332bc9) into main (99a0fb3) will increase coverage by 0.37%.
The diff coverage is 98.73%.

@@            Coverage Diff             @@
##             main     #152      +/-   ##
==========================================
+ Coverage   93.18%   93.56%   +0.37%     
==========================================
  Files          76       80       +4     
  Lines        1775     1879     +104     
  Branches      133      146      +13     
==========================================
+ Hits         1654     1758     +104     
  Misses        121      121

Impacted Files	Coverage Δ
...la/io/qbeast/spark/index/query/QueryExecutor.scala	`97.05% <ø> (ø)`
...la/io/qbeast/spark/internal/rules/SampleRule.scala	`89.47% <ø> (ø)`
...n/scala/io/qbeast/core/model/RevisionClasses.scala	`77.35% <90.90%> (+3.54%)`	⬆️
...o/qbeast/spark/delta/QbeastMetadataOperation.scala	`72.85% <95.65%> (+3.01%)`	⬆️
...io/qbeast/core/transform/EmptyTransformation.scala	`100.00% <100.00%> (ø)`
...la/io/qbeast/core/transform/EmptyTransformer.scala	`100.00% <100.00%> (ø)`
src/main/scala/io/qbeast/spark/QbeastTable.scala	`95.34% <100.00%> (+0.28%)`	⬆️
...la/io/qbeast/spark/delta/DeltaMetadataWriter.scala	`93.75% <100.00%> (+1.02%)`	⬆️
...la/io/qbeast/spark/delta/DeltaQbeastSnapshot.scala	`93.75% <100.00%> (+4.09%)`	⬆️
...ala/io/qbeast/spark/delta/IndexStatusBuilder.scala	`100.00% <100.00%> (ø)`
... and 9 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

core/src/main/scala/io/qbeast/core/model/RevisionClasses.scala

core/src/main/scala/io/qbeast/core/transform/EmptyTransformation.scala

core/src/main/scala/io/qbeast/core/transform/EmptyTransformer.scala

src/main/scala/io/qbeast/spark/delta/DeltaQbeastSnapshot.scala

src/main/scala/io/qbeast/spark/delta/OTreeIndex.scala

src/main/scala/io/qbeast/spark/delta/QbeastMetadataOperation.scala

src/main/scala/io/qbeast/spark/utils/Params.scala

docs/QbeastFormat.md

osopardo1 · 2023-01-25T13:44:32Z

Hello! I did the first review, feel free to discuss any of my comments in the thread.

One more suggestion: I think it's better to call it Convert To Qbeast (Or incremental conversion to qbeast). Compatibility might be a confusing word (since we are already compatible) and the scope of this PR is actually being able to convert the table. The hybrid support is an inevitable change for the final result.

osopardo1 and others added 30 commits May 10, 2022 14:47

First naive implementation of convert to qbeast

8a61514

Add headers

99a823b

Merge branch 'main' into convert-to-qbeast

f8bd3d0

Change RunnableCommand to LeafRunnableCommand

feb0754

Access deltaLog/snapshot and modify log

5d4e0bc

Add transformers to revision

51c5bb6

Converted tables should be readable using both formates

9a2236b

Add AddFiles with qbeast metadata using append write mode

76219b3

Check input format

d3d472d

Simplify test

8f5ffe2

Refactor code

ac90d08

Add method for parquet to delta conversion

b4d1466

Reformat tests

471cb06

Separate metadata tag computation

b22afc0

Extract record count from parquet file metadata

1de47a8

Add test for parquet to qbeast conversion, test index metrics resulte…

d435fc7

…d from conversion

Test String data type

0d26406

Make command idempotent

6f92dcc

Test command idempotence

0208678

Test converting a partitioned delta table

909bc11

Use global path parameter

c30c6e8

Consider cases where the provided fileFormat doesn't match with real …

304c991

…format of the table

Add format inferrence, conversion for partitioned parquet files, infe…

72e40ae

…rr schema for partitioned parquet files

Add comment

f009a54

Test extraction of numRecords when AddFile stats is corrupted

daf4454

Remove unnecessary assertions

c8efff5

Spark data type name extraction for only supported partition data types

389401d

Test supported partition data types

1a4e5d5

Secure return type

f5858f9

Use wider default column min max to avoid out-of-scope points

02e9100

Jiaweihu08 added the type: enhancement Improvement of existing feature or code label Jan 25, 2023

Jiaweihu08 requested a review from osopardo1 January 25, 2023 10:57

Jiaweihu08 self-assigned this Jan 25, 2023

osopardo1 reviewed Jan 25, 2023

View reviewed changes

Jiaweihu08 changed the title ~~Qbeast-delta compatibility~~ Conver to Qbeast Jan 25, 2023

osopardo1 changed the title ~~Conver to Qbeast~~ Convert to Qbeast Jan 25, 2023

Jiaweihu08 added 14 commits January 25, 2023 15:07

Place staging revision utils under RevisionUtils

ab91ff0

Test EmptyTransformer/ation

d550497

Add comment and remove unnecessary changes

6bc7ceb

Move params and methods to RevisionUtils

a7cbe6e

isStagingFile is specific of QbeastSnapshot

b01e806

Correct typo

ba00ec4

Fix loadRevisionAt

8969983

'Remove' partitioned table conversion support

d61538f

Create exception message object

5620d25

Update documentation

c64fb5c

Test table identifier format

f089c50

Test loadRevisionAt with invalid timestamp

b69320f

Skip Analyze and Optimize for the staging RevisionID

60d36aa

Use AnalysisException, more test for EmptyTransformer

57cacd4

osopardo1 approved these changes Jan 27, 2023

View reviewed changes

Update metadata in MetadataManager

1332bc9

Jiaweihu08 merged commit 3706738 into Qbeast-io:main Jan 27, 2023

This was referenced Jan 27, 2023

Make files without Metadata readable with Qbeast #121

Closed

Support update metadata through MetadataManager #149

Closed

This was linked to issues Jan 27, 2023

Make files without Metadata readable with Qbeast #121

Closed

Support update metadata through MetadataManager #149

Closed

Jiaweihu08 deleted the read-staging-data branch January 27, 2023 13:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert to Qbeast #152

Convert to Qbeast #152

Jiaweihu08 commented Jan 25, 2023 •

edited

Loading

codecov bot commented Jan 25, 2023 •

edited

Loading

osopardo1 commented Jan 25, 2023 •

edited

Loading

Convert to Qbeast #152

Convert to Qbeast #152

Conversation

Jiaweihu08 commented Jan 25, 2023 • edited Loading

Description

Type of change

Checklist:

How Has This Been Tested? (Optional)

codecov bot commented Jan 25, 2023 • edited Loading

Codecov Report

osopardo1 commented Jan 25, 2023 • edited Loading

Jiaweihu08 commented Jan 25, 2023 •

edited

Loading

codecov bot commented Jan 25, 2023 •

edited

Loading

osopardo1 commented Jan 25, 2023 •

edited

Loading