-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid repeated reading of the DeltaLog #65
Avoid repeated reading of the DeltaLog #65
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we do buildCubesStatuses in memory now, what was the need?
Also, we could change some class/method names to improve readability and be more consistent with the documentation.
In relation to this comment, I did the following comparisons using different approaches (collect before, collect after and the code in the df.write
.format("qbeast")
.option("columnsToIndex", "ss_item_sk,ss_customer_sk")
.option("cubeSize", "300000")
.save(qbeastDir) The times compared are from two different methods:
val qbeast = spark.read.format("qbeast").load(qbeastDir)
val delta = spark.read.format("delta").load(qbeastDir)
qbeast.count()
delta.count()
val deltaLog = DeltaLog.forTable(spark, qbeastDir)
val tahoeLogFileIndex =
TahoeLogFileIndex(spark, deltaLog, deltaLog.dataPath, deltaLog.snapshot)
val oTreeIndex = OTreeIndex(tahoeLogFileIndex)
tahoeLogFileIndex.listFiles(partitionFilters = Seq.empty, dataFilters = Seq.empty)
oTreeIndex.listFiles(partitionFilters = Seq.empty, dataFilters = Seq.empty) These are the implementations tested:
The tests were run locally. Feel free to discuss this approach here, and to suggest reversion of the changes! Thank you! |
Change QbeastFile to QbeastBlock Make matchingBlocks method protected Filtering by revision on DeltaQbeastSnapshot
Made more test on a distributed cluster, just to have more numbers. Here are the results:
We can see two things:
I will revert to a |
This PR fixes #61
The changes made are:
OTreeIndex
now extendsFileIndex
instead ofTahoeLogFileIndex
previouslyMatchedFiles
anymoreQbeastFile
now it's namedQbeastBlock
QbeastBlock
: size and modificationTimeCubeStatus
now containsQbeastBlock
objects instead of the path