Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify log replay visitor and avoid materializing Add/Remove actions #494

Merged
merged 21 commits into from
Nov 26, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Nov 14, 2024

What changes are proposed in this pull request?

The existing log replay logic materialized full Add and Remove actions in order to examine the four fields of each that comprise the file comparison key. There was also a lot of indirection and duplication because of an old (now-unused) flavor of log replay. It also created a new expression evaluator for each batch instead of reusing it for the whole iteration.

Streamline the logic to only visit the columns of interest and generally reduce code bloat.

How was this change tested?

Existing log replay unit tests.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Nov 14, 2024
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 72.28261% with 102 lines in your changes missing coverage. Please review.

Project coverage is 79.70%. Comparing base (67cc099) to head (1b76ce5).

Files with missing lines Patch % Lines
kernel/src/engine/arrow_data.rs 66.08% 23 Missing and 16 partials ⚠️
kernel/src/actions/visitors.rs 47.05% 34 Missing and 2 partials ⚠️
kernel/src/scan/log_replay.rs 85.32% 6 Missing and 10 partials ⚠️
kernel/src/scan/state.rs 54.54% 4 Missing and 1 partial ⚠️
kernel/src/actions/mod.rs 70.00% 2 Missing and 1 partial ⚠️
kernel/src/actions/deletion_vector.rs 90.00% 1 Missing ⚠️
kernel/src/actions/set_transaction.rs 0.00% 0 Missing and 1 partial ⚠️
kernel/src/scan/data_skipping.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #494      +/-   ##
==========================================
- Coverage   79.82%   79.70%   -0.12%     
==========================================
  Files          57       57              
  Lines       12591    12676      +85     
  Branches    12591    12676      +85     
==========================================
+ Hits        10051    10104      +53     
- Misses       2006     2045      +39     
+ Partials      534      527       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}
}

fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
let expected_getters = if self.is_log_batch { 29 } else { 15 };
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ~4x improvement from column pruning

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

self.get_add_transform_expr(),
SCAN_ROW_DATATYPE.clone(),
)
.evaluate(actions)?;
Ok((result, selection_vector))
}

// work shared between process_batch and process_scan_batch
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: process_batch was deleted dead code, so we can fold the logic of this method into its single call site

@scovich scovich mentioned this pull request Nov 19, 2024
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(reviewed 4 commits 416e7bd..1b76ce5) LGTM! very excited for this to land, thanks @scovich!!! i asked for docs in quite a few places, hopefully i'm not going overboard on doc requests (please tell me if i am). this has awesome APIs and expect a decent perf boost :) - in that vein I should have some benchmarks spun up in the new few weeks for us!

kernel/src/actions/visitors.rs Show resolved Hide resolved
}
}

fn visit<'a>(&mut self, row_count: usize, getters: &[&'a dyn GetData<'a>]) -> DeltaResult<()> {
let expected_getters = if self.is_log_batch { 29 } else { 15 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
Comment on lines 87 to 92
Some(storage_type) => Some(DeletionVectorDescriptor::unique_id_from_parts(
storage_type,
getters[1].get(i, "deletionVector.pathOrInlineDv")?,
getters[2].get_opt(i, "deletionVector.offset")?,
)),
None => None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just do a .map(...)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A map call would intercept those ?, which is annoying and bulky to deal with:

        let dv_unique_id = getters[0]
            .get_opt(i, "deletionVector.storageType")?
            .map(|storage_type| {
                DeletionVectorDescriptor::unique_id_from_parts(
                    storage_type,
                    getters[1].get(i, "deletionVector.pathOrInlineDv")?,
                    getters[2].get_opt(i, "deletionVector.offset")?,
                )
            })
            .transpose()?;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I was thinking of a storage_type.map(DeletionVectorDescriptor...) but probably has the same ? issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah...

removes: Vec<Remove>,
selection_vector: Option<Vec<bool>>,
// whether or not we are visiting commit json (=true) or checkpoint (=false)
struct AddRemoveDedupVisitor<'seen> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe quick docs sorry for asking for docs so much lol and probably useful to derive debug? (i think every field already implements Debug?)

Suggested change
struct AddRemoveDedupVisitor<'seen> {
#[derive(Debug)]
struct AddRemoveDedupVisitor<'seen> {

Copy link
Collaborator Author

@scovich scovich Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc comment added. Is it normal in rust to derive Debug for non-public classes that don't (yet) need it?
(seems easy enough to add later if/when a need arises?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can leave it without - generally i [derive(Debug)] on nearly everything just under the assumption that someone will want to debug print it or log in the future etc.

}

Ok(visitor.adds)
// TODO: Teach expression eval to respect the selection vector we just computed so carefully!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursory look didn't yield an issue for this - should we create one to make sure it doesn't get dropped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should. I believe kernel-java already has this support.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 86.91589% with 14 lines in your changes missing coverage. Please review.

Project coverage is 80.78%. Comparing base (a245355) to head (39458f2).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/scan/log_replay.rs 87.50% 3 Missing and 9 partials ⚠️
kernel/src/actions/deletion_vector.rs 90.00% 1 Missing ⚠️
kernel/src/actions/visitors.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #494      +/-   ##
==========================================
- Coverage   81.04%   80.78%   -0.27%     
==========================================
  Files          67       67              
  Lines       14099    14074      -25     
  Branches    14099    14074      -25     
==========================================
- Hits        11426    11369      -57     
- Misses       2091     2130      +39     
+ Partials      582      575       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small nits/questions :)

is_log_batch,
..Default::default()
if self.seen.contains(&key) {
debug!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your criteria for putting in debug logging? I want to know when it might be useful to incorporate into my own code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any really concrete criteria... but if I added a println! for debugging that turned out to be especially useful, I'm a lot more likely to leave it there as a debug!?

// may have a remove with a path at index 4.
let (path, getters, is_add) = if let Some(path) = getters[0].get_str(i, "add.path")? {
(path, &getters[1..4], true)
} else if !self.is_log_batch {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand is_log_batch. This is a visitor for log actions, so presumably everything is a log batch?

Might be worth a doc comment or a rename.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea maybe this could use some more clarity in the code but ill just jump in to say: is_log_batch is supposed to mean is_commit_batch (that is, a batch of actions from commits vs checkpoints)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha makes sense, ty!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is really old terminology -- "log file" in the Delta spec means a nnn.json file -- what we normally call "commit file" today.

kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
fn process_scan_batch(
&mut self,
expression_handler: &dyn ExpressionHandler,
expression: &dyn ExpressionEvaluator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expression implies this is a predicate. Perhaps expression_evaluator, but your call :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to say (conceptually) what it is: add_transform

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: renamed the wrong one (or rather, renamed a different good one)

I renamed this one as predicate

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and renamed back to add_transform because it's not a predicate at all

@OussamaSaoudi-db
Copy link
Collaborator

One more thing: If it doesn't block you, I'd love to see tests to make sure that:

  1. We are constructing correct selection vector
  2. That we do filter batches that have no actions selected

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restamp

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just two small naming/comment nits

kernel/src/scan/log_replay.rs Show resolved Hide resolved
kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved
@scovich
Copy link
Collaborator Author

scovich commented Nov 26, 2024

I'd love to see tests to make sure that:

1. We are constructing correct selection vector

Existing unit tests fail if the selection vector is wrong (that's how I figured out the vector needs to initialize as all-true rather than all-false like it used to).

2. That we do filter batches that have no actions selected

You mean, because data skipping eliminated all adds? The code already has coverage from at least one unit test.

@scovich scovich removed the breaking-change Change that will require a version bump label Nov 26, 2024
@scovich
Copy link
Collaborator Author

scovich commented Nov 26, 2024

FYI: I noticed the most recent semver check had not found anything, so I removed the label and retriggered the job. It completed successfully without re-adding the label, confirming that this is not a breaking change. Most likely the label came from the refactor PR it was originally stacked on.

@scovich scovich merged commit 485525b into delta-io:main Nov 26, 2024
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants