-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add visitor SidecarVisitor
and Sidecar
action struct
#673
base: main
Are you sure you want to change the base?
feat: add visitor SidecarVisitor
and Sidecar
action struct
#673
Conversation
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is quoted from the protocol linked here.
I am assuming this means we will need to handle both cases where the path is:
- just the file name of the sidecar files
- the absolute path (including parent directories)
Or can we assume that implementations are not only encouraged, but DO simply store the file name of the sidecar files as the path
?
Delta spark assumes that path
is just the file name:
https://github.com/delta-io/delta/blob/990c8e8aedb69a1660e7780ab1d1a19f5c95615f/kernel/kernel-api/src/main/java/io/delta/kernel/internal/replay/ActionsIterator.java#L274
Can I get a 👍 to do the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the input @roeap. instead of blocking on updating the protocol, I'm going to go ahead with implementing support for only filenames in this PR along with error handling when a file path is passed. Supporting full-path parsing should be supported like you said so I've opened up an issue for future work #675
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the route we decide to go down, make sure that's clear in the docs.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #673 +/- ##
==========================================
+ Coverage 84.14% 84.17% +0.03%
==========================================
Files 77 77
Lines 17710 17834 +124
Branches 17710 17834 +124
==========================================
+ Hits 14902 15012 +110
- Misses 2096 2101 +5
- Partials 712 721 +9 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. just some minor comments.
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think we have to stick to the protocol here and assume it may also be a URL. That said, not sure how many writers out there in the wild already write v2 checkpoints. Maybe we can still update the protocol?
kernel/src/actions/visitors.rs
Outdated
r#"{"sidecar":{"path":"016ae953-37a9-438e-8683-9a9a4a79a395.parquet","sizeInBytes":9268,"modificationTime":1714496113961,"tags": null}}"#, | ||
r#"{"sidecar":{"path":"3a0d65cd-4056-49b8-937b-95f9e3ee90e5.parquet","sizeInBytes":9268,"modificationTime":1714496113962,"tags": null}}"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just merge this test with test_parse_action_batch_without_sidecar_actions
by moving the sidecar actions here into the action_batch
. That we we test both that:
- We correctly parse sidecar actions
- We correctly ignore irrelevant actions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ similar to protocol and cdc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make one of the sidecar files have a real tags field with a string-string map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I'm not certain "tags": null
is something that occurs in tables in the wild. Would tags
just be omitted? In any case, we should probably test that we can take a sidecar
action with no tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced with Oussama on tags field, which can be:
not defined (omitted) / null (which does not get omitted from the action & shows up as null) / string: string map
Will be merging the tests and I'll add a sidecar action into the action_batch
to make sure it is ignored during parsing for other actions.
@@ -544,6 +584,36 @@ mod tests { | |||
Ok(()) | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: None of these tests check the error case where a required field is missing 🤔 Maybe we should make an issue to add those. Personally I'd also like to see a VisitorError
type so that we can make stronger assertions on the failure cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! #674
// Since path column is required, use it to detect presence of a sidecar action | ||
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? { | ||
if self.seen_paths.contains(path) { | ||
warn!("Duplicate sidecar path {} found during visiting", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a batch of actions includes two sidecar actions referencing the same file, we issue a warning. We will ignore the duplicate file path to prevent unnecessary re-reading of the file and not throw an error as we can still construct table state.
This is done as a best-effort. Ideally we could ensure no sidecar files are read more than once by tracking sidecar file paths across all batches that makeup a checkpoint file, but this introduces an unbounded memory dependency on the # of sidecar-paths encountered (as we would need to store them all somehow). I do not think the added performance in cases where sidecar files are referenced more than once (unlikely) warrants more than this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.
I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.
#[cfg_attr(test, derive(Serialize, Default), serde(rename_all = "camelCase"))] | ||
struct Sidecar { | ||
/// A path to the sidecar file. Because sidecar files must always reside in the table's own | ||
/// _delta_log/_sidecars directory, implementations are encouraged to store only the file's name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the route we decide to go down, make sure that's clear in the docs.
// Since path column is required, use it to detect presence of a sidecar action | ||
if let Some(path) = getters[0].get_opt(i, "sidecar.path")? { | ||
if self.seen_paths.contains(path) { | ||
warn!("Duplicate sidecar path {} found during visiting", path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's worth doing this extra work just to issue a warning. The vast majority of cases, we won't encounter this. If we do encounter it, AddRemoveDedupVisitor will handle it. I think of it like this: Expensive checks should only be done when it's a matter of correctness. All other checks/warnings should be cheap to do.
I would consider this an expensive check because we use double the memory to maintain both the Vec and the HashSet.
|
||
// The visitor should error out when it encounters a full file path | ||
let res = visitor.visit_rows_of(batch.as_ref()).unwrap_err(); | ||
assert!(matches!(res, Error::VisitorError(_))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sry I should've made myself clear. VisitorError should definitely be in a separate PR/issue. Rust errors are best represented as enums where every error case is its own enum variant. That's a non-trivial amount of code, so it's best left for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh all right, I can go ahead and revert this
// The second sidecar is a duplicate of the first, so one is ignored | ||
assert_eq!(visitor.sidecars.len(), 1); | ||
assert_eq!(visitor.sidecars[0], sidecar1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned above, I don't think this is a case we want to optimize/test for.
continue; | ||
} | ||
|
||
if path.contains('/') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have consensus that path should only be the file name, add a comment explaining that we only allow filename and not path.
What changes are proposed in this pull request?
This PR introduces the
Sidecar
action and its associatedSidecarVisitor
.This action and visitor is used for the V2 checkpoints Reader/Writer table feature.
Edge cases:
path
field is a full file-path instead of just the filename. Ref to issue: Support full paths instead of only file-names in sidecar actions'spath
field #675resolves #668
Note: PR also simplifies some conditional flags detailed in issue: #672
How was this change tested?
Sidecar
worksSidecar
actionspaths
in sidecar actions are ignoredVisitorError
is returned when full file path is passed inpath
field