v0.3.2
Bip bip new version of qbeast-spark
with some awesome algorithm improvements 🥳
What's changed
-
Better file sizes! Now the final size of the cubes corresponds to the
cubeSize
used to write the data. You can find more information about the algorithm changes and performance numbers in the merged PR #156 . -
Register Operation Metrics and Per-file Statistics [Delta]. Statistical information of the columns (
min
,max
,nullCount
) is gathered in order to perform a better data skipping. -
Option for specifying min/max values of the indexed columns. It will allow a more flexible creation of Revision, in order to include values that might not be in the newly indexed
Dataframe
.df.write.format("qbeast") .option("columnsToIndex", "a,b") .option("columnStats","""{"a_min":0,"a_max":10,"b_min":20.0,"b_max":70.0}""") .save("/tmp/table")
The enforced structure of the JSON is:
{ "columnName_min" : value "columnName_max" : value }
Minor changes
(click to see)
Contributors
Special thanks to @Jiaweihu08, who took the Qbeast Format files to the next level with the Domain-Driven algorithm!
@cugni @osopardo1
Full Changelog: v0.3.1...v0.3.2