You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
appendOnly as an inter-snapshot validation check that rejects any attempt to update a dataset with changes that modify data that existed in the prior snapshot. If appendOnly has been true for all snapshots of a dataset, the dataset is truly append only.
Why we need it
@ajbouh has requested append-only capacity for datasets, citing append-only as an important rule for Machine Learning use cases. We've encountered many other use cases where append-only is a necessity, so I think it makes sense to do it now.
Why it's wrong and bad
Chucking appendOnly in as a struct field is a shortsighted footgun. We should properly plan this feature out, looking for other edit restrictions & structure this as a configuration instance in a broader category of inter-snapshot validation.
If anyone has time or energy to invest in this better solution, I'm interested. But we have neither time nor energy, and append-only is such a frequently-cited use that I'm a fan of having it as an explicit flag that doesn't require me to read through, like, all the documentation. Deadline to object to us using this footgun is 2018-04-13. Comment on this issue to prevent footgunning.
How we gonna do it
Any dataset writes must pass through dsfs.CreateDataset. CreateDataset rejects when validate.Dataset returns an error. So we need a few things:
a way to signal to validate.Dataset that it should validate that a change is append-only
a validator that actually checks for append-only-ness
a code path that calls that validator in validate.Dataset
a well-written error to send back when things are bad
Thankfully, our jsonschema package comes with support for custom validators. So let's implement this as a custom jsonschema validator, registered under the appendOnly keyword that accepts a bool value. The validator should use a factory function pattern to generate validator instances that check against an existing dataset. Put another way, on each call to dataset.Validate, we need to generate a new appendOnly validator from the dataset we were given, because the check has to happen in context of a snapshot. This is the nasty part of this job, but will be worth it. In the future we can use this technique to do all sorts of inter-snapshot checks, and be able to control performance characteristics on the spot.
So this is also going to require a PR on jsonschema. I'll file that issue later. Basic gist is we need to change this to use some sort of decoder-generator pattern instead of a hardcoded DefaultValidator thingie.
Side Effects
after this PR, any time we validate a dataset, it must call validate.Dataset, because that's where we're loading this custom stuff, and I don't want this logic repeated all over the place
this is going to cause some performance slowdowns, because dataset.Validate is now going to need to be able to load the previous dataset's data to do validation. I'm thinking we'll compensate for this by transforming dataset.PrevPath to be a *dataset.Dataset that can carry an in-memory reference. We should work to make sure these slowdowns only trigger when appendOnly is in play.
As per our Q2 goals, I'm not allowed to ship this issue. So someone else is going to have to step up 😄
The text was updated successfully, but these errors were encountered:
I should point out that performance-wise, my alternative to a qri-specific approach would be pretty fast. It's just: demand that the first N bytes of the current dataset have the same checksum that the previous version had, where N is the length of the previous dataset in bytes. This places a few restrictions on data layout and things like that, but is overall fairly easy to implement, optimize, and understand.
Ah yes, I should be clear that the performance hit I'm referring to here comes from data overfetching our architecture currently implies. I'm hoping to tackle after we land the change in future versions, just wanted to make clear in the issue that I'm fine with a short-term slowdown to land this feature
What is it
appendOnly
as an inter-snapshot validation check that rejects any attempt to update a dataset with changes that modify data that existed in the prior snapshot. IfappendOnly
has been true for all snapshots of a dataset, the dataset is truly append only.Why we need it
@ajbouh has requested append-only capacity for datasets, citing append-only as an important rule for Machine Learning use cases. We've encountered many other use cases where append-only is a necessity, so I think it makes sense to do it now.
Why it's wrong and bad
Chucking
appendOnly
in as a struct field is a shortsighted footgun. We should properly plan this feature out, looking for other edit restrictions & structure this as a configuration instance in a broader category of inter-snapshot validation.If anyone has time or energy to invest in this better solution, I'm interested. But we have neither time nor energy, and append-only is such a frequently-cited use that I'm a fan of having it as an explicit flag that doesn't require me to read through, like, all the documentation. Deadline to object to us using this footgun is 2018-04-13. Comment on this issue to prevent footgunning.
How we gonna do it
Any dataset writes must pass through
dsfs.CreateDataset
. CreateDataset rejects whenvalidate.Dataset
returns an error. So we need a few things:Given that it's validation oriented, the natural place to put the
appendOnly
flag is indataset.Structure.Schema
. The field itself is a jsonschema, and jsonschema doesn't support this kind of validation out of the box. There is some tangentially-related discussion regarding changing jsonschema for off-label uses here:https://github.com/json-schema-org/json-schema-spec/issues/309
Thankfully, our jsonschema package comes with support for custom validators. So let's implement this as a custom jsonschema validator, registered under the
appendOnly
keyword that accepts a bool value. The validator should use a factory function pattern to generate validator instances that check against an existing dataset. Put another way, on each call todataset.Validate
, we need to generate a newappendOnly
validator from the dataset we were given, because the check has to happen in context of a snapshot. This is the nasty part of this job, but will be worth it. In the future we can use this technique to do all sorts of inter-snapshot checks, and be able to control performance characteristics on the spot.So this is also going to require a PR on jsonschema. I'll file that issue later. Basic gist is we need to change this to use some sort of decoder-generator pattern instead of a hardcoded
DefaultValidator
thingie.Side Effects
dataset.PrevPath
to be a*dataset.Dataset
that can carry an in-memory reference. We should work to make sure these slowdowns only trigger whenappendOnly
is in play.As per our Q2 goals, I'm not allowed to ship this issue. So someone else is going to have to step up 😄
The text was updated successfully, but these errors were encountered: