v0.6.0
WARNING: This release includes breaking changes to the Format. If you have tables written prior to the 0.6.0 version, you can convert them following the documentation.
What's Changed?
1. New Qbeast Metadata to solve small files problem
Fixes the small file problem in incremental appends by adding support for multiple-block files. This change reduces the amount of files loaded when executing a query, improving the overall reading performance.
Before 0.6.0, each file would only contain information about one single cube. This causes the data to be spread amongst many small files, creating bigger overheads when reading from a specific area.
New AddFile
tags
schema (>v0.6.0)
"tags": {
"revision": "1",
"blocks": [
{
"cube": "w",
"minWeight": 2,
"maxWeight": 3,
"replicated": false,
"elementCount": 4
},
{
"cube": "wg",
"minWeight": 5,
"maxWeight": 6,
"replicated": false,
"elementCount": 7
},
]
}
The MultiBlock file approach, allows each file to contain multiple Blocks from different Cubes. This means, that the Metadata in each AddFile
is modified, and such change can compromise old tables.
Make sure to follow the guides to transform an old table (<0.6.0) to the new format.
2. Balanced file layout with Domain-Driven Appends
Another of the upgrades we made in the new code, is using Cube Domains Strategy for appending data incrementally. The change uses the existing index during partition-level domain estimation to help reduce the number of cubes with outdated max weights from 45% to 0.16%, producing a more stable and balanced file layout.
Fixes #226. Full details in #227
3. AutoIndexing Feature
Say goodbye to the .option("columnsToIndex", "a,b")
. The new AutoIndexing feature chooses the best columns to organize the data automatically.
It is NOT enabled by default. If you want to use it, you should add the necessary configuration.
spark.qbeast.index.columnsToIndex.auto=true
spark.qbeast.index.columnsToIndex.auto.max=10
4. Support for Spark 3.5.x and Delta 3.1.x
Upgrade to the latest version of the Dependencies. New libraries include:
Read everything on the Apache Spark page and Delta Lake Release.
Other Features
- Adds #288: Including more log messages in critical parts of the code. Make the code easier to debug and understand what is happening.
- Adds #261: Block filtering during Sampling. Lesser files to read, faster results.
- Adds #253: File Skipping with Delta. Initial results show an improvement of 10x by applying Delta's file skipping on Delta Log's entries.
- Adds #243:
txnVersion
andtxnAppId
are included inQbeastOptions
to write streaming data. - Adds #236: Update SBT / scalastyle frameworks.
- Fixed #312:
dataChange
on Optimization is set tofalse
. - Fixed #315: solve roll-up cube count.
- Fixed #317: no overhead during optimization.
Bug Fixes
- Fix #246: Create an External Table w/ Location loads the existing configuration instead of throwing errors.
- Fix #281: Schema Merge and Schema Overwrite mimic Delta Lake's behavior.
- Fix #228: Correct implementation of CubeId hash equals.
Contributors
@Jiaweihu08 @fpj @cdelfosse @alexeiakimov @osopardo1
Full Changelog: v0.5.0...v0.6.0