-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify log replay visitor and avoid materializing Add/Remove actions #494
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #494 +/- ##
==========================================
- Coverage 79.82% 79.70% -0.12%
==========================================
Files 57 57
Lines 12591 12676 +85
Branches 12591 12676 +85
==========================================
+ Hits 10051 10104 +53
- Misses 2006 2045 +39
+ Partials 534 527 -7 ☔ View full report in Codecov by Sentry. |
} | ||
} | ||
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let expected_getters = if self.is_log_batch { 29 } else { 15 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice ~4x improvement from column pruning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
self.get_add_transform_expr(), | ||
SCAN_ROW_DATATYPE.clone(), | ||
) | ||
.evaluate(actions)?; | ||
Ok((result, selection_vector)) | ||
} | ||
|
||
// work shared between process_batch and process_scan_batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: process_batch
was deleted dead code, so we can fold the logic of this method into its single call site
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(reviewed 4 commits 416e7bd..1b76ce5
) LGTM! very excited for this to land, thanks @scovich!!! i asked for docs in quite a few places, hopefully i'm not going overboard on doc requests (please tell me if i am). this has awesome APIs and expect a decent perf boost :) - in that vein I should have some benchmarks spun up in the new few weeks for us!
} | ||
} | ||
|
||
fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> { | ||
let expected_getters = if self.is_log_batch { 29 } else { 15 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
kernel/src/scan/log_replay.rs
Outdated
Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts( | ||
storage_type, | ||
getters[1].get(i, "deletionVector.pathOrInlineDv")?, | ||
getters[2].get_opt(i, "deletionVector.offset")?, | ||
)), | ||
None => None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could just do a .map(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A map
call would intercept those ?
, which is annoying and bulky to deal with:
let dv_unique_id = getters[0]
.get_opt(i, "deletionVector.storageType")?
.map(|storage_type| {
DeletionVectorDescriptor::unique_id_from_parts(
storage_type,
getters[1].get(i, "deletionVector.pathOrInlineDv")?,
getters[2].get_opt(i, "deletionVector.offset")?,
)
})
.transpose()?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I was thinking of a storage_type.map(DeletionVectorDescriptor...)
but probably has the same ?
issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah...
removes: Vec<Remove>, | ||
selection_vector: Option<Vec<bool>>, | ||
// whether or not we are visiting commit json (=true) or checkpoint (=false) | ||
struct AddRemoveDedupVisitor<'seen> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe quick docs sorry for asking for docs so much lol and probably useful to derive debug? (i think every field already implements Debug?)
struct AddRemoveDedupVisitor<'seen> { | |
#[derive(Debug)] | |
struct AddRemoveDedupVisitor<'seen> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc comment added. Is it normal in rust to derive Debug for non-public classes that don't (yet) need it?
(seems easy enough to add later if/when a need arises?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can leave it without - generally i [derive(Debug)]
on nearly everything just under the assumption that someone will want to debug print it or log in the future etc.
} | ||
|
||
Ok(visitor.adds) | ||
// TODO: Teach expression eval to respect the selection vector we just computed so carefully! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cursory look didn't yield an issue for this - should we create one to make sure it doesn't get dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should. I believe kernel-java already has this support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #494 +/- ##
==========================================
- Coverage 81.04% 80.78% -0.27%
==========================================
Files 67 67
Lines 14099 14074 -25
Branches 14099 14074 -25
==========================================
- Hits 11426 11369 -57
- Misses 2091 2130 +39
+ Partials 582 575 -7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small nits/questions :)
is_log_batch, | ||
..Default::default() | ||
if self.seen.contains(&key) { | ||
debug!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your criteria for putting in debug logging? I want to know when it might be useful to incorporate into my own code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any really concrete criteria... but if I added a println!
for debugging that turned out to be especially useful, I'm a lot more likely to leave it there as a debug!
?
// may have a remove with a path at index 4. | ||
let (path, getters, is_add) = if let Some(path) = getters[0].get_str(i, "add.path")? { | ||
(path, &getters[1..4], true) | ||
} else if !self.is_log_batch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand is_log_batch
. This is a visitor for log actions, so presumably everything is a log batch?
Might be worth a doc comment or a rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea maybe this could use some more clarity in the code but ill just jump in to say: is_log_batch
is supposed to mean is_commit_batch
(that is, a batch of actions from commits vs checkpoints)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aha makes sense, ty!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is really old terminology -- "log file" in the Delta spec means a nnn.json file -- what we normally call "commit file" today.
kernel/src/scan/log_replay.rs
Outdated
fn process_scan_batch( | ||
&mut self, | ||
expression_handler: &dyn ExpressionHandler, | ||
expression: &dyn ExpressionEvaluator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expression
implies this is a predicate. Perhaps expression_evaluator
, but your call :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to say (conceptually) what it is: add_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: renamed the wrong one (or rather, renamed a different good one)
I renamed this one as predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and renamed back to add_transform
because it's not a predicate at all
One more thing: If it doesn't block you, I'd love to see tests to make sure that:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just two small naming/comment nits
Existing unit tests fail if the selection vector is wrong (that's how I figured out the vector needs to initialize as all-true rather than all-false like it used to).
You mean, because data skipping eliminated all adds? The code already has coverage from at least one unit test. |
FYI: I noticed the most recent semver check had not found anything, so I removed the label and retriggered the job. It completed successfully without re-adding the label, confirming that this is not a breaking change. Most likely the label came from the refactor PR it was originally stacked on. |
What changes are proposed in this pull request?
The existing log replay logic materialized full Add and Remove actions in order to examine the four fields of each that comprise the file comparison key. There was also a lot of indirection and duplication because of an old (now-unused) flavor of log replay. It also created a new expression evaluator for each batch instead of reusing it for the whole iteration.
Streamline the logic to only visit the columns of interest and generally reduce code bloat.
How was this change tested?
Existing log replay unit tests.