Skip to content

v0.3.2

Compare
Choose a tag to compare
@osopardo1 osopardo1 released this 14 Mar 09:30
9c15d17

Bip bip new version of qbeast-spark with some awesome algorithm improvements 🥳

What's changed

  • Better file sizes! Now the final size of the cubes corresponds to the cubeSize used to write the data. You can find more information about the algorithm changes and performance numbers in the merged PR #156 .

  • Register Operation Metrics and Per-file Statistics [Delta]. Statistical information of the columns (min, max, nullCount) is gathered in order to perform a better data skipping.

  • Option for specifying min/max values of the indexed columns. It will allow a more flexible creation of Revision, in order to include values that might not be in the newly indexed Dataframe.

    df.write.format("qbeast")
    .option("columnsToIndex", "a,b")
    .option("columnStats","""{"a_min":0,"a_max":10,"b_min":20.0,"b_max":70.0}""")
    .save("/tmp/table")
    

    The enforced structure of the JSON is:

    {
        "columnName_min" : value
        "columnName_max" : value
    
    }

Minor changes

 (click to see)
  • Fixed #165 . Create External Table with Qbeast without specifying the schema.
  • Fixed #149. Update metadata through MetadataManager

Contributors

Special thanks to @Jiaweihu08, who took the Qbeast Format files to the next level with the Domain-Driven algorithm!
@cugni @osopardo1
Full Changelog: v0.3.1...v0.3.2