Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Spark 3.4.1 and Delta 2.4.0 #211

Merged
merged 7 commits into from
Sep 26, 2023

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Aug 30, 2023

This Draft PR is for updating versions of Spark and Delta to the latest available!

Here's a summary of some interesting changes that are included in the updates.

Spark 3.4.x

Read the full notes here: Apache Spark Version 3.4 , Apache Spark Version 3.4.1

  • Python Client for Spark Connect.
  • Timestamp without timezone support. (TimestampNTZ Type ).
  • Support timestamp in seconds for TimeTravel using Dataframe options
  • Better Spark UI scalability and Driver stability.
  • Catalog API compatible with 3 Layer Namespace.

Delta 2.4.0

Read the full notes here: Delta Lake Version 2.4.0

  • Support for Apache Spark 3.4
  • Deletion Vectors implementation. Improve the way UPDATES are handled and recovered, adopting a "Merge On Read" strategy.
  • Support for writing and reading with Deletion Vectors and PURGE in case we want to deactivate the feature for the table (which will invoke a rewrite of some of the files).
  • Support REPLACE WHERE expressions in SQL to selectively overwrite data.
  • Change on the DeltaLog API: Now snapshot is no longer reliable to return the last Snapshot available. Use DeltaLog.update() instead.
  • NOTE: This version does not include the Iceberg to Delta converter

@osopardo1
Copy link
Member Author

osopardo1 commented Aug 30, 2023

The current status is the following:

  • New APIs of Table and DeltaLog adapted to the current code. (Lots of files changed in this PR because of small improvements in the dependencies).
  • Introducing compatibility with TimestampNTZ.
  • Some bugs were detected on CREATE TABLE and INSERT INTO. Looking into it. 👀

@osopardo1
Copy link
Member Author

Following the previous comment, I investigated a little bit more about the problem and opened this issue

I will discard all the code to solve the INSERT INTO with TIMESTAMP_NTZ here, so we can decouple both developments.
And before continuing with the Upgrade, we should fix #213 first.

@codecov
Copy link

codecov bot commented Sep 1, 2023

Codecov Report

Merging #211 (ab32cc5) into main (62bb000) will decrease coverage by 0.41%.
The diff coverage is 72.34%.

❗ Current head ab32cc5 differs from pull request most recent head 71a60ac. Consider uploading reports for the commit 71a60ac to get more accurate results

@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
- Coverage   92.40%   92.00%   -0.41%     
==========================================
  Files          87       88       +1     
  Lines        2187     2214      +27     
  Branches      177      168       -9     
==========================================
+ Hits         2021     2037      +16     
- Misses        166      177      +11     
Files Changed Coverage Δ
...o/qbeast/core/transform/LinearTransformation.scala 88.15% <ø> (ø)
...ast/spark/internal/rules/QbeastAnalysisUtils.scala 36.20% <23.52%> (-6.35%) ⬇️
src/main/scala/io/qbeast/spark/QbeastTable.scala 92.59% <100.00%> (ø)
...n/scala/io/qbeast/spark/delta/CubeDataLoader.scala 100.00% <100.00%> (ø)
...la/io/qbeast/spark/delta/DeltaMetadataWriter.scala 93.75% <100.00%> (ø)
.../main/scala/io/qbeast/spark/delta/OTreeIndex.scala 87.03% <100.00%> (+0.24%) ⬆️
...ala/io/qbeast/spark/delta/StagingDataManager.scala 96.29% <100.00%> (ø)
...ark/internal/commands/ConvertToQbeastCommand.scala 100.00% <100.00%> (ø)
.../internal/sources/catalog/DefaultStagedTable.scala 100.00% <100.00%> (ø)
...spark/internal/sources/catalog/QbeastCatalog.scala 97.50% <100.00%> (+0.23%) ⬆️
... and 5 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@osopardo1 osopardo1 marked this pull request as ready for review September 21, 2023 13:10
@osopardo1 osopardo1 requested a review from cugni September 21, 2023 13:10
@osopardo1
Copy link
Member Author

osopardo1 commented Sep 22, 2023

Pending TODO: Support TIMESTAMP_NTZ to index.
That would be for the next iteration. I would open an issue.

@cugni cugni merged commit 46ebe7d into Qbeast-io:main Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants