Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add visitor SidecarVisitor and Sidecar action struct #673

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Feb 1, 2025

What changes are proposed in this pull request?

This PR introduces the Sidecar action and its associated SidecarVisitor.

This action and visitor is used for the V2 checkpoints Reader/Writer table feature.

Edge cases:

resolves #668

Note: PR also simplifies some conditional flags detailed in issue: #672

How was this change tested?

  • Ensure schema projection to Sidecar works
  • Ensure that the visitor correctly extracts Sidecar actions
  • Ensure duplicate paths in sidecar actions are ignored
  • Ensure VisitorError is returned when full file path is passed in path field

#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is quoted from the protocol linked here.

I am assuming this means we will need to handle both cases where the path is:

  1. just the file name of the sidecar files
  2. the absolute path (including parent directories)

Or can we assume that implementations are not only encouraged, but DO simply store the file name of the sidecar files as the path?

Delta spark assumes that path is just the file name:
https://github.com/delta-io/delta/blob/990c8e8aedb69a1660e7780ab1d1a19f5c95615f/kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java#L274

Can I get a 👍 to do the same?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input @roeap. instead of blocking on updating the protocol, I'm going to go ahead with implementing support for only filenames in this PR along with error handling when a file path is passed. Supporting full-path parsing should be supported like you said so I've opened up an issue for future work #675

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the route we decide to go down, make sure that's clear in the docs.

Copy link

codecov bot commented Feb 1, 2025

Codecov Report

Attention: Patch coverage is 89.23077% with 14 lines in your changes missing coverage. Please review.

Project coverage is 84.17%. Comparing base (06d8dbb) to head (91e550c).

Files with missing lines Patch % Lines
kernel/src/actions/visitors.rs 88.07% 4 Missing and 9 partials ⚠️
ffi/src/error.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
+ Coverage   84.14%   84.17%   +0.03%     
==========================================
  Files          77       77              
  Lines       17710    17834     +124     
  Branches    17710    17834     +124     
==========================================
+ Hits        14902    15012     +110     
- Misses       2096     2101       +5     
- Partials      712      721       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sebastiantia sebastiantia marked this pull request as ready for review February 1, 2025 01:04
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. just some minor comments.

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
Comment on lines 592 to 593
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#,
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just merge this test with test_parse_action_batch_without_sidecar_actions by moving the sidecar actions here into the action_batch. That we we test both that:

  1. We correctly parse sidecar actions
  2. We correctly ignore irrelevant actions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ similar to protocol and cdc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make one of the sidecar files have a real tags field with a string-string map.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I'm not certain "tags": null is something that occurs in tables in the wild. Would tags just be omitted? In any case, we should probably test that we can take a sidecar action with no tags.

Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced with Oussama on tags field, which can be:
not defined (omitted) / null (which does not get omitted from the action & shows up as null) / string: string map

Will be merging the tests and I'll add a sidecar action into the action_batch to make sure it is ignored during parsing for other actions.

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@@ -544,6 +584,36 @@ mod tests {
Ok(())
}

#[test]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: None of these tests check the error case where a required field is missing 🤔 Maybe we should make an issue to add those. Personally I'd also like to see a VisitorError type so that we can make stronger assertions on the failure cases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! #674

// Since path column is required, use it to detect presence of a sidecar action
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? {
if self.seen_paths.contains(path) {
warn!("Duplicate sidecar path {} found during visiting", path);
Copy link
Collaborator Author

@sebastiantia sebastiantia Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a batch of actions includes two sidecar actions referencing the same file, we issue a warning. We will ignore the duplicate file path to prevent unnecessary re-reading of the file and not throw an error as we can still construct table state.

This is done as a best-effort. Ideally we could ensure no sidecar files are read more than once by tracking sidecar file paths across all batches that makeup a checkpoint file, but this introduces an unbounded memory dependency on the # of sidecar-paths encountered (as we would need to store them all somehow). I do not think the added performance in cases where sidecar files are referenced more than once (unlikely) warrants more than this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.

I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Feb 3, 2025
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))]
struct Sidecar {
/// A path to the sidecar file. Because sidecar files must always reside in the table's own
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the route we decide to go down, make sure that's clear in the docs.

// Since path column is required, use it to detect presence of a sidecar action
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? {
if self.seen_paths.contains(path) {
warn!("Duplicate sidecar path {} found during visiting", path);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.

I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.


// The visitor should error out when it encounters a full file path
let res = visitor.visit_rows_of(batch.as_ref()).unwrap_err();
assert!(matches!(res, Error::VisitorError(_)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sry I should've made myself clear. VisitorError should definitely be in a separate PR/issue. Rust errors are best represented as enums where every error case is its own enum variant. That's a non-trivial amount of code, so it's best left for later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh all right, I can go ahead and revert this

Comment on lines +652 to +654
// The second sidecar is a duplicate of the first, so one is ignored
assert_eq!(visitor.sidecars.len(), 1);
assert_eq!(visitor.sidecars[0], sidecar1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I mentioned above, I don't think this is a case we want to optimize/test for.

continue;
}

if path.contains('/') {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have consensus that path should only be the file name, add a comment explaining that we only allow filename and not path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Change that will require a version bump
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New visitor SidecarVisitor and Sidecar action struct
3 participants