-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add invariant enforcement support #834
Conversation
8d160d7
to
55a57a2
Compare
I think this is a good idea 👍 |
a00af2c
to
a013681
Compare
BTW I just measured, and adding the |
|
||
let mut violations: Vec<String> = Vec::new(); | ||
|
||
for invariant in self.invariants.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test whether it would be faster to concatenate all invariants into a single filter clause using OR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I missed this. I did it separately so that we can provide the user the exact value that violated an invariant. But if it become a performance problem we can re-evaluate. Probably will care more for Constraits than invariants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
If there are particular fat in datafusion that we don't need, we could send upstream PRs to gate those code by features. Also now that datafusion has become a hard dependency for the python binding, we can start to expose fancy query capabilities in the python libraries. For example, we can now add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really excited to to see this feature land! Great job!
Left a comment about creating the tokio runtime, but we can figure that out once we integrate this into the rust write path.
rust/src/delta_datafusion.rs
Outdated
Self { | ||
invariants, | ||
ctx: SessionContext::new(), | ||
rt: tokio::runtime::Runtime::new().unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't fully thought this through yet, but I think if we want to use this on the rust side as well, we may run into issues trying to crate a new runtime within a currently running one.
Would it make sense to make check_batch
and enforce_invariants
async, and create the runtime inside the call of the PyDeltaDataChecker
? Then again, for the python users this would be a non-breaking change and for rust there is nothing to break yet, so I guess we can figure this out, once we integrate in the rust write path.
Yes, agree with Robert that making these methods async and manage the tokio
runtime outside of our rust core is the right thing to do here.
…On Sun, Sep 25, 2022 at 23:02 Robert Pack ***@***.***> wrote:
***@***.**** approved this pull request.
Really excited to to see this feature land! Great job!
Left a comment about creating the tokio runtime, but we can figure that
out once we integrate this into the rust write path.
------------------------------
In rust/src/delta_datafusion.rs
<#834 (comment)>:
> @@ -603,11 +606,84 @@ fn left_larger_than_right(left: ScalarValue, right: ScalarValue) -> Option<bool>
}
}
+/// Responsible for checking batches of data conform to table's invariants.
+pub struct DeltaDataChecker {
+ invariants: Vec<Invariant>,
+ ctx: SessionContext,
+ rt: tokio::runtime::Runtime,
+}
+
+impl DeltaDataChecker {
+ /// Create a new DeltaDataChecker
+ pub fn new(invariants: Vec<Invariant>) -> Self {
+ Self {
+ invariants,
+ ctx: SessionContext::new(),
+ rt: tokio::runtime::Runtime::new().unwrap(),
Haven't fully thought this through yet, but I think if we want to use this
on the rust side as well, we may run into issues trying to crate a new
runtime within a currently running one.
Would it make sense to make check_batch and enforce_invariants async, and
create the runtime inside the call of the PyDeltaDataChecker? Then again,
for the python users this would be a non-breaking change and for rust there
is nothing to break yet, so I guess we can figure this out, once we
integrate in the rust write path.
—
Reply to this email directly, view it on GitHub
<#834 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFDUXSXS67CZCUD2GMYYGTWAE35ZANCNFSM6AAAAAAQSU4TUU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I have been struggling with this for some time, and am currently experimenting with an updated writer implementation that can honor some of the delta settings. We do have a writer trait already, but right now that is simply used to have support for different types of data being written (json, record_batch, ..) Maybe that writer trait should eventually expose functions for checking invariants and constraints as well a |
Description
Adds support to retrieve invariants from the Delta schema and also a struct
DeltaDataChecker
to use DataFusion to check them and report useful errors.This also hooks it up to the Python bindings, allowing
write_deltalake()
to support Writer Protocol V2.I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion.
Related Issue(s)
enforce_invariant()
function #592Documentation
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants