Evaluating Record-to-Record references #34

jeancochrane · 2018-07-16T19:23:23Z

Overview

This PR adds an ADR evaluating the idea of Record-to-Record references proposed in #32. It also makes a minor update to the README in this library to create a table of contents for existing documentation.

In the ADR, I outline what I believe to be the minimum requirements that a Record-to-Record reference field must fulfill in order to be a feasible feature. Then, I present findings on each of these requirements.

Notes

Best viewed on GitHub.
An ADR is perhaps an unusual choice for this document, since I'm really just presenting my research on whether the feature is feasible or not. However, considering that the research is tangled up in the question "would it be possible and valuable to implement this feature in the next five weeks," the ADR format felt right to me as I was writing.

Closes #36.

ddohler

Thanks for digging into this! Hopefully the comments I left are helpful -- let me know if anything's not clear!

ddohler · 2018-07-17T13:34:37Z

docs/foreign-keys.md

+Null pointers are indeed a problem with Record-to-Record foreign keys, but this
+problem is already endemic to Grout: if a field is removed from a schema during
+an upgrade to a new version (i.e. a definition of a new RecordSchema), Grout
+silently permits the field to remain but become unaccessible. From the reverse


I believe the deleted field is still be accessible when viewing a record, but it won't be usable for filtering.

Good point! That's an important distinction. So a more accurate breakdown of "accessibility" is:

Field type is accessible is filterable

Normal field X X

Deleted field X

Deleted Record-to-Record reference

ddohler · 2018-07-17T13:55:32Z

docs/foreign-keys.md

+perspective, if a field is _added_ in the new schema, old records will not store
+values for that field, and so will not appear in the relevant filters.
+
+While not precisely a "null pointer," the issue is functionally the same: some data


I don't quite agree with this, because in the case of schema changes, Records are always generated with a foreign key to the RecordSchema that was active at the time they were created. It's true that the frontend currently just shrugs and assumes that all Records have the same RecordSchema, but the data is still available in the database, and it's not too hard to imagine a smarter frontend that would explicitly allow users to interact with the different versions of the RecordSchema. By squashing the schema versions together you could assemble a set of filters that would be applicable to all records, which would allow older records to be filtered based on the fields that they actually have. In the case of a deleted related Record, that data is actually gone, so I think we'd have to come up with a different approach for handling this case.

That's a good point! When I was first investigating the data model, I put some thought into versioning for Records (notes here: azavea/ashlar#80). Reading back on that, it seems to imply that Records and RecordTypes cannot be deleted, they can simply have an active value of True or False. I wonder if this is already exposed in the data collector frontend? If it's truly impossible for a user to fully delete a Record (without calling a manual SQL update), would that address some of your concern?

Yeah, I think not allowing deletions is potentially a good way to address this. If deletion is not allowed, a Record would always have the data it was created with available, even if that data was marked deleted. Versioning would be even better, because then if a referent gets changed, any records referring to it would still have access to the original data. As an alternative to ON DELETE I kind of prefer this because I think it aligns better with the data model for Schemas, and because it keeps deletions constant-time. With ON DELETE, a deletion could potentially be very costly if the Record being deleted was referred to by many other Records.

ddohler · 2018-07-17T14:10:33Z

docs/foreign-keys.md

+a `RefResolver` class that it uses in the `validate` function for [resolving external
+references](https://python-jsonschema.readthedocs.io). It seems relatively
+straightforward to extend this resolver to check references by pinging the Grout
+API.


One nice thing about the existing schema validation is that it's theoretically vanilla JSONSchema; you could run a Grout Schema and Record through any JSONSchema validator and get the same result. Correct me if I'm misunderstanding, but it looks like this is suggesting adding custom validation logic that a pure JSONSchema validator wouldn't be able to understand, because it requires knowledge of Grout's conventions to follow the foreign key (which would just look like a UUID string to a standard JSONSchema library)?

I think that's indeed what I'm proposing here, and your criticism is on point: to move in this direction would be to tightly integrate with the particular implementation of python-jsonschema. Your proposal to use the $ref attribute instead seems like a much better approach.

For what it's worth, one of the modifications needed to the JSON schema parser used for the Android app was to parse URL $ref targets, here.

ddohler · 2018-07-17T14:41:47Z

docs/foreign-keys.md

+- `jsonb_path_ops` indexes, which only support the containment operator `@>`
+ but are much faster (least flexible, most performant)
+
+Since we currently only use the containment operator in Grout anyway, I see no


It looks like djsonb implements other operators, such as ?, which are available for indexing from the default index, but not with jsonb_path_ops. However, we don't seem to ever use any of those operators, so I agree that currently we could use jsonb_path_ops for a presumed speed boost. However, we've discussed experimenting with moving away from djsonb and using Django's built-in JSONB support, and I don't know what operators it makes use of.

I also dimly remember that we maybe experimented with the two different index types and didn't see a big difference between using jsonb_path_ops and not.

That's good to know! I'll flesh out this section a little bit with information about Django's implementation. I think it would make sense to use the default index type, and then consider using jsonb_path_ops if performance becomes a serious concern.

ddohler · 2018-07-17T15:34:16Z

docs/foreign-keys.md

+ "title": "RecordType of the related Record",
+ "type": "string",
+ "format": "select",
+ "enumSource": [["Will be dynamically overwritten"]]


An alternative way to do this, which I think might address the issue about custom validation below, would be to create Grout endpoints that provide a list of the IDs of valid referents, which you could then include via $ref. That might look something like this:

"referenceTarget: { "allOf": [{ "$ref": "https://my-grout-server.com/api/record-types/<UUID>/referents.json" }] }

and then the API would return something pretty simple:

{ "type": { "enum": [ <list of valid UUIDs> ] } }

The benefit of this would be that a standard JSONSchema validator would be able to follow the reference, retrieve the list of valid UUIDs, and validate the record, without making any changes to library functionality.

The downside (and this is similar to the null-pointer issue, below) would be that the schema would become somewhat slippery, in the sense that deleting Records referenced by other Records could cause those records to no longer validate, even if they were valid when they were created and the base schema hasn't changed. This might be able to be addressed by doing some additional bookkeeping behind the scenes such that when a potential referent is deleted, we also either delete or nullify (depending on whether the schema allows nulls) any records that depend on it.

Figuring out authentication might also be tricky.

I like the $ref pattern you sketched out a lot -- really nice that it can express the idea in pure JSONSchema!

It seems like you're suggesting implementing a kind of limited ON DELETE standard, which would make a lot of sense. At the very least, allowing the equivalents of CASCADE and SET NULL seems reasonable to me.

I still don't quite understand the validation issue, however. My understanding is that a Record only gets validated when it enters the database (via POST) or is updated (via PUT). To me, it seems desirable to complain loudly in those two cases if a reference doesn't resolve properly. When else might we expect validation to get triggered?

My initial concern with validation was that we currently have as an invariant that jsonschema.validate(record.schema.schema, record.data) (made-up function) is always true. This would make that no longer an invariant because some parts of the schema would be dynamic based on database state, but I didn't have a more concrete worry than that. Now that you've narrowed things down to creation and update (which I think is correct), I think that my primary concern is about the update case -- it doesn't seem very nice to allow a record to silently get into an invalid state, such that an unsuspecting user could try to make an edit that was unrelated to the external reference and find themselves receiving validation errors. I think that having ON DELETE equivalents would still solve the issue, however.

This may be a case where the JSON schema standard hasn't quite caught up with the usage we want, but it's under consideration for draft 4. I also stumbled across some good discussion here: json-schema-org/json-schema-spec#514 regarding delegation versus inclusion for reference targets.

I think that's probably okay then; it looks like they're up to Draft 7 now: http://json-schema.org/specification-links.html#draft-4 . It's very confusing that they just call everything drafts; I'm unclear whether any of these have been finalized.

That is confusing. The json-editor library README claims compatibility with both versions 3 and 4, so it's unclear where they intend to stand on reference behavior then.

As an aside, I just noticed a Vue JSON schema validator exists. It's linked to in the schema docs, here. Maybe that would be helpful to look at for some of the front-end work.

The PR that closed the delegation-vs-inclusion issue was just merged eighteen days ago, so this aspect does seem to be under current development.

ddohler · 2018-07-17T16:28:54Z

docs/foreign-keys.md

+
+#### Lookups
+
+Since Record-to-Record foreign keys would store UUIDs (as strings) for the


I'd like to see some more detail here; we do some queries in DRIVER that are quite a bit more complex than the API lookups that the base Grout API offers. I don't think we necessarily need to be able to accomplish exactly the same queries with external references as we can accomplish with internal references, but it would be helpful to know where the boundaries are.

In particular, the query I'm thinking of is for generating some of the analysis that DRIVER provides: https://github.com/WorldBank-Transport/DRIVER/blob/master/app/data/views.py#L332

That query is a bit dense, but the gist of it is that it allows you to ask questions like "for each different vehicle type, how many crashes involved vehicles of that type during each of the past 12 months?" Currently, that can be done with a single (complicated) query (well, two, I think, one to get the schema up front). You could imagine wanting to ask a similar question about external references as well, for example, "For each Event genre (assuming you have some sort of classification system for Events), how many posters were recorded over each of the past 12 months?"

The reason I'm asking this is that I think a lot of people reach for a normalized data model by instinct, whereas Grout currently only offers a denormalized data model. If we add normalized data model capabilities, I think one thing that'll be important is to have a clear understanding for when to choose one over the other, and knowing if there are any limitations on the analysis you can do with normalized data will be a big part of developing that understanding.

I appreciate the thoroughness here -- I breezed past this section sort of assuming it was solved, so it's helpful to have some concrete problems to look at. I don't have much to say about it now since I'm still getting my head around the particular problem, but I agree that knowing the limits of this approach for querying ahead of time is critical.

jeancochrane

Thanks for your detailed review on this, @ddohler! You identified a lot of areas that I hadn't put enough thought into, which was exactly what I was hoping for. I left some replies to your comments, but if writing about this PR is becoming a big time sink, you should feel free to table it for now -- I'd be happy to have a quick meeting at some point this week to clarify our understanding together.

jeancochrane · 2018-07-17T17:17:42Z

docs/foreign-keys.md

+ "title": "RecordType of the related Record",
+ "type": "string",
+ "format": "select",
+ "enumSource": [["Will be dynamically overwritten"]]


I like the $ref pattern you sketched out a lot -- really nice that it can express the idea in pure JSONSchema!

It seems like you're suggesting implementing a kind of limited ON DELETE standard, which would make a lot of sense. At the very least, allowing the equivalents of CASCADE and SET NULL seems reasonable to me.

I still don't quite understand the validation issue, however. My understanding is that a Record only gets validated when it enters the database (via POST) or is updated (via PUT). To me, it seems desirable to complain loudly in those two cases if a reference doesn't resolve properly. When else might we expect validation to get triggered?

jeancochrane · 2018-07-17T17:20:25Z

docs/foreign-keys.md

+Null pointers are indeed a problem with Record-to-Record foreign keys, but this
+problem is already endemic to Grout: if a field is removed from a schema during
+an upgrade to a new version (i.e. a definition of a new RecordSchema), Grout
+silently permits the field to remain but become unaccessible. From the reverse


Good point! That's an important distinction. So a more accurate breakdown of "accessibility" is:

Field type is accessible is filterable

Normal field X X

Deleted field X

Deleted Record-to-Record reference

jeancochrane · 2018-07-17T17:25:04Z

docs/foreign-keys.md

+perspective, if a field is _added_ in the new schema, old records will not store
+values for that field, and so will not appear in the relevant filters.
+
+While not precisely a "null pointer," the issue is functionally the same: some data


That's a good point! When I was first investigating the data model, I put some thought into versioning for Records (notes here: azavea/ashlar#80). Reading back on that, it seems to imply that Records and RecordTypes cannot be deleted, they can simply have an active value of True or False. I wonder if this is already exposed in the data collector frontend? If it's truly impossible for a user to fully delete a Record (without calling a manual SQL update), would that address some of your concern?

jeancochrane · 2018-07-17T17:30:44Z

docs/foreign-keys.md

+
+#### Lookups
+
+Since Record-to-Record foreign keys would store UUIDs (as strings) for the


I appreciate the thoroughness here -- I breezed past this section sort of assuming it was solved, so it's helpful to have some concrete problems to look at. I don't have much to say about it now since I'm still getting my head around the particular problem, but I agree that knowing the limits of this approach for querying ahead of time is critical.

jeancochrane · 2018-07-17T17:33:41Z

docs/foreign-keys.md

+a `RefResolver` class that it uses in the `validate` function for [resolving external
+references](https://python-jsonschema.readthedocs.io). It seems relatively
+straightforward to extend this resolver to check references by pinging the Grout
+API.


I think that's indeed what I'm proposing here, and your criticism is on point: to move in this direction would be to tightly integrate with the particular implementation of python-jsonschema. Your proposal to use the $ref attribute instead seems like a much better approach.

jeancochrane · 2018-07-17T17:37:56Z

docs/foreign-keys.md

+- `jsonb_path_ops` indexes, which only support the containment operator `@>`
+ but are much faster (least flexible, most performant)
+
+Since we currently only use the containment operator in Grout anyway, I see no


That's good to know! I'll flesh out this section a little bit with information about Django's implementation. I think it would make sense to use the default index type, and then consider using jsonb_path_ops if performance becomes a serious concern.

jeancochrane · 2018-07-18T14:14:38Z

Your comments about normalized data models got me thinking. For reference, here's what you said in the review:

The reason I'm asking this is that I think a lot of people reach for a normalized data model by instinct, whereas Grout currently only offers a denormalized data model. If we add normalized data model capabilities, I think one thing that'll be important is to have a clear understanding for when to choose one over the other, and knowing if there are any limitations on the analysis you can do with normalized data will be a big part of developing that understanding.

This struck me as very true, and had me wondering: How could I accomplish what I wanted without a normalized data model? Is there an inherent hierarchy to the data that is relevant to the demo app?

I took a moment to sketch the data model in normalized/denormalized fashions. In an ideal world, here's what the normalized data model would look like (apologies for the ad hoc pseudo-data-model syntax):

EventSeries {
    title
}

PosterSeries {
    title
}

Event {
    date/time
    location
    series -> EventSeries
}

Poster {
    date/time
    location
    image
    series -> PosterSeries
    event -> Event
}

To my surprise, the data model is decently hierarchical. It fits nicely in a denormalized model:

EventSeries {
  title
  events [
    Event {
      date/time
      location
      posterseries [
        PosterSeries {
          title
          posters [
            Poster {
              date/time
              location
              image
            }
            Poster {
              date/time
              location
              image
            }
          ]
        }
      ]
    }
  ]
}

Within each of the nested objects, you could imagine many children -- i.e. an EventSeries can contain multiple Events, an Event can contain multiple PosterSeries, and a PosterSeries can contain multiple posters (as it does in the example above).

So, it seems there's a nonrelational way to store the information I care about in this app. Immediately, a few questions jump out at me:

What information do we lose by denormalizing the data model? Aggregate queries will become more expensive, particularly for geospatial questions -- e.g. "How many Posters are within a mile of their Event"? But in general, I'm surprised by how much of the model is captured by a pure hierarchy.
Is this possible under the current Grout data model? Not really. While we could potentially convert the data model above into a RecordSchema for a single parent RecordType (EventSeries), there are a few problems we would have to address:
- The schema relies on nested objects, which are not supported in the schema editor
- The schema stores geospatial information inside nested objects, as opposed to on the top-level of the Record
  - This violates the expectations of the data model, and also isn't supported in the schema editor
If this isn't possible under the current Grout data model, is it more difficult to accomodate this type of work, or to allow foreign keys? I go back and forth on this one. Curious to hear what you think!

flibbertigibbet

Great discussion.

This can be revisited later as our understanding of needs evolves.

jeancochrane · 2018-07-31T19:43:18Z

I integrated many of our discussions into the ADR (and linked to the threads where appropriate), and updated the "Implementation" section with a more robust sketch of the schema that can validate using $refs.

I think I'm feeling comfortable moving forward with this, but I'm also unsure how to prioritize it. It seems like this feature is critical for the demo app, but potentially not a major feature for prospective new users -- my inclination is that the latter is more important than the former in our remaining fellowship time. I'm leaning in the direction of prioritizing other work to improve the schema editor (#21 and #29) before embarking on this journey.

ddohler

This is great! I agree with you that this puts us in a good place to move forward with implementation if we want to. I added a couple more issues that I thought of with this scheme, but I don't think they are dealbreakers.

I also agree that given the amount of time we have left and the fact that the demo app is in a pretty good place, it may make more sense to lock in some concrete (and important!) usability improvements rather than trying to tackle a major new feature that may end up incomplete or rushed.

ddohler · 2018-08-01T16:34:57Z

docs/foreign-keys.md

+ all Records for a RecordType regardless of whether or not they're active could
+ potentially be confusing for an end user doing data collection work, both
+ because it might be hard to tell which Records are active and because the
+ endpoint may return overwhelmingly large lists of Records. Records


Another potential issue is that the referents.json endpoint could end up timing out or blowing up the memory of the JSON-Schema validator if the number of records is very large.

This also might require referents.json to be public, which could be problematic under some situations. Ordinarily I wouldn't think that a list of Record UUIDs would be an issue but depending on the type of information being stored, just the count of active records could potentially be sensitive.

Great points! I added two notes to this section based on your comment.

jeancochrane added 2 commits July 16, 2018 14:59

Add first draft of ADR for Record-to-Record references

7df2cf7

Clean up foreign key ADR diff and add docs to README

2943d0a

jeancochrane requested review from ddohler and flibbertigibbet July 16, 2018 19:24

ddohler reviewed Jul 17, 2018

View reviewed changes

jeancochrane commented Jul 17, 2018

View reviewed changes

flibbertigibbet approved these changes Jul 24, 2018

View reviewed changes

Flesh out foreign key ADR based on comments

e18052f

ddohler approved these changes Aug 1, 2018

View reviewed changes

Add notes about performance and permissions to foreign key doc

d5f26f1

jeancochrane merged commit eb6b62a into master Aug 1, 2018

jeancochrane deleted the foreign-key-adr branch August 1, 2018 17:18

jeancochrane mentioned this pull request Aug 23, 2018

Allow foreign key relationships between Records #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating Record-to-Record references #34

Evaluating Record-to-Record references #34

jeancochrane commented Jul 16, 2018 •

edited

Loading

ddohler left a comment

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

ddohler Jul 18, 2018

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

flibbertigibbet Jul 18, 2018

ddohler Jul 17, 2018

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

ddohler Jul 18, 2018

flibbertigibbet Jul 18, 2018

ddohler Jul 18, 2018

flibbertigibbet Jul 18, 2018

flibbertigibbet Jul 18, 2018

ddohler Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane left a comment

jeancochrane Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane Jul 17, 2018

jeancochrane commented Jul 18, 2018 •

edited

Loading

flibbertigibbet left a comment

jeancochrane commented Jul 31, 2018 •

edited

Loading

ddohler left a comment

ddohler Aug 1, 2018

jeancochrane Aug 1, 2018

Field type	is accessible	is filterable
Normal field	X	X
Deleted field	X
Deleted Record-to-Record reference


		#### Lookups

		Since Record-to-Record foreign keys would store UUIDs (as strings) for the

Evaluating Record-to-Record references #34

Evaluating Record-to-Record references #34

Conversation

jeancochrane commented Jul 16, 2018 • edited Loading

Overview

Notes

ddohler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Jul 18, 2018 • edited Loading

flibbertigibbet left a comment

Choose a reason for hiding this comment

jeancochrane commented Jul 31, 2018 • edited Loading

ddohler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Jul 16, 2018 •

edited

Loading

jeancochrane commented Jul 18, 2018 •

edited

Loading

jeancochrane commented Jul 31, 2018 •

edited

Loading