Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add invariant enforcement support #834

Merged
merged 15 commits into from
Sep 28, 2022
Merged

Conversation

wjones127
Copy link
Collaborator

@wjones127 wjones127 commented Sep 22, 2022

Description

Adds support to retrieve invariants from the Delta schema and also a struct DeltaDataChecker to use DataFusion to check them and report useful errors.

This also hooks it up to the Python bindings, allowing write_deltalake() to support Writer Protocol V2.

I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion.

Related Issue(s)

Documentation

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-invariants

@houqp
Copy link
Member

houqp commented Sep 25, 2022

I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion.

I think this is a good idea 👍

@wjones127
Copy link
Collaborator Author

BTW I just measured, and adding the datafusion-ext feature into the Python package increases the (release) wheel size from 11MB to 18MB. I think that's fine, but we might keep an eye on the size and look out for opportunities to slim in down.

@wjones127 wjones127 marked this pull request as ready for review September 25, 2022 19:14

let mut violations: Vec<String> = Vec::new();

for invariant in self.invariants.iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test whether it would be faster to concatenate all invariants into a single filter clause using OR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I missed this. I did it separately so that we can provide the user the exact value that violated an invariant. But if it become a performance problem we can re-evaluate. Probably will care more for Constraits than invariants.

houqp
houqp previously approved these changes Sep 26, 2022
Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@houqp
Copy link
Member

houqp commented Sep 26, 2022

BTW I just measured, and adding the datafusion-ext feature into the Python package increases the (release) wheel size from 11MB to 18MB

If there are particular fat in datafusion that we don't need, we could send upstream PRs to gate those code by features.

Also now that datafusion has become a hard dependency for the python binding, we can start to expose fancy query capabilities in the python libraries. For example, we can now add a sql method to the python delta table class to allow users query delta tables directly by passing a sql query!

roeap
roeap previously approved these changes Sep 26, 2022
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to to see this feature land! Great job!

Left a comment about creating the tokio runtime, but we can figure that out once we integrate this into the rust write path.

Self {
invariants,
ctx: SessionContext::new(),
rt: tokio::runtime::Runtime::new().unwrap(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't fully thought this through yet, but I think if we want to use this on the rust side as well, we may run into issues trying to crate a new runtime within a currently running one.

Would it make sense to make check_batch and enforce_invariants async, and create the runtime inside the call of the PyDeltaDataChecker? Then again, for the python users this would be a non-breaking change and for rust there is nothing to break yet, so I guess we can figure this out, once we integrate in the rust write path.

@houqp
Copy link
Member

houqp commented Sep 26, 2022 via email

@roeap
Copy link
Collaborator

roeap commented Sep 26, 2022

I looked briefly at the Rust writer, but then realized we don't want to introduce a dependency on DataFusion. We should discuss how we want to design that API. I suspect we'll turn DeltaDataChecker into a trait, so we can have a DataFusion one available but also allow other engines to implement it themselves if they don't wish to use DataFusion.

I have been struggling with this for some time, and am currently experimenting with an updated writer implementation that can honor some of the delta settings. We do have a writer trait already, but right now that is simply used to have support for different types of data being written (json, record_batch, ..) Maybe that writer trait should eventually expose functions for checking invariants and constraints as well a max_supported_version. Then we could hopefully keep datafusion optional and enable higher writer versions with a "reasonable" amount of features gates on the implementation.

@wjones127 wjones127 dismissed stale reviews from roeap and houqp via 6ef1086 September 28, 2022 02:10
@wjones127 wjones127 enabled auto-merge (squash) September 28, 2022 04:07
@wjones127 wjones127 merged commit e2cbc79 into delta-io:main Sep 28, 2022
@houqp houqp requested a review from roeap September 28, 2022 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write enforce_invariant() function Python: Support writer protocol V2 in write_deltalake
3 participants