Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating Record-to-Record references #34

Merged
merged 4 commits into from
Aug 1, 2018
Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Jul 16, 2018

Overview

This PR adds an ADR evaluating the idea of Record-to-Record references proposed in #32. It also makes a minor update to the README in this library to create a table of contents for existing documentation.

In the ADR, I outline what I believe to be the minimum requirements that a Record-to-Record reference field must fulfill in order to be a feasible feature. Then, I present findings on each of these requirements.

Notes

  • Best viewed on GitHub.

  • An ADR is perhaps an unusual choice for this document, since I'm really just presenting my research on whether the feature is feasible or not. However, considering that the research is tangled up in the question "would it be possible and valuable to implement this feature in the next five weeks," the ADR format felt right to me as I was writing.

Closes #36.

Copy link
Contributor

@ddohler ddohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this! Hopefully the comments I left are helpful -- let me know if anything's not clear!

Null pointers are indeed a problem with Record-to-Record foreign keys, but this
problem is already endemic to Grout: if a field is removed from a schema during
an upgrade to a new version (i.e. a definition of a new RecordSchema), Grout
silently permits the field to remain but become unaccessible. From the reverse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the deleted field is still be accessible when viewing a record, but it won't be usable for filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! That's an important distinction. So a more accurate breakdown of "accessibility" is:

Field type is accessible is filterable
Normal field X X
Deleted field X
Deleted Record-to-Record reference

perspective, if a field is _added_ in the new schema, old records will not store
values for that field, and so will not appear in the relevant filters.

While not precisely a "null pointer," the issue is functionally the same: some data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite agree with this, because in the case of schema changes, Records are always generated with a foreign key to the RecordSchema that was active at the time they were created. It's true that the frontend currently just shrugs and assumes that all Records have the same RecordSchema, but the data is still available in the database, and it's not too hard to imagine a smarter frontend that would explicitly allow users to interact with the different versions of the RecordSchema. By squashing the schema versions together you could assemble a set of filters that would be applicable to all records, which would allow older records to be filtered based on the fields that they actually have. In the case of a deleted related Record, that data is actually gone, so I think we'd have to come up with a different approach for handling this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! When I was first investigating the data model, I put some thought into versioning for Records (notes here: azavea/ashlar#80). Reading back on that, it seems to imply that Records and RecordTypes cannot be deleted, they can simply have an active value of True or False. I wonder if this is already exposed in the data collector frontend? If it's truly impossible for a user to fully delete a Record (without calling a manual SQL update), would that address some of your concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think not allowing deletions is potentially a good way to address this. If deletion is not allowed, a Record would always have the data it was created with available, even if that data was marked deleted. Versioning would be even better, because then if a referent gets changed, any records referring to it would still have access to the original data. As an alternative to ON DELETE I kind of prefer this because I think it aligns better with the data model for Schemas, and because it keeps deletions constant-time. With ON DELETE, a deletion could potentially be very costly if the Record being deleted was referred to by many other Records.

a `RefResolver` class that it uses in the `validate` function for [resolving external
references](https://python-jsonschema.readthedocs.io). It seems relatively
straightforward to extend this resolver to check references by pinging the Grout
API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about the existing schema validation is that it's theoretically vanilla JSONSchema; you could run a Grout Schema and Record through any JSONSchema validator and get the same result. Correct me if I'm misunderstanding, but it looks like this is suggesting adding custom validation logic that a pure JSONSchema validator wouldn't be able to understand, because it requires knowledge of Grout's conventions to follow the foreign key (which would just look like a UUID string to a standard JSONSchema library)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's indeed what I'm proposing here, and your criticism is on point: to move in this direction would be to tightly integrate with the particular implementation of python-jsonschema. Your proposal to use the $ref attribute instead seems like a much better approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, one of the modifications needed to the JSON schema parser used for the Android app was to parse URL $ref targets, here.

- `jsonb_path_ops` indexes, which only support the containment operator `@>`
but are much faster (least flexible, most performant)

Since we currently only use the containment operator in Grout anyway, I see no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like djsonb implements other operators, such as ?, which are available for indexing from the default index, but not with jsonb_path_ops. However, we don't seem to ever use any of those operators, so I agree that currently we could use jsonb_path_ops for a presumed speed boost. However, we've discussed experimenting with moving away from djsonb and using Django's built-in JSONB support, and I don't know what operators it makes use of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also dimly remember that we maybe experimented with the two different index types and didn't see a big difference between using jsonb_path_ops and not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know! I'll flesh out this section a little bit with information about Django's implementation. I think it would make sense to use the default index type, and then consider using jsonb_path_ops if performance becomes a serious concern.

"title": "RecordType of the related Record",
"type": "string",
"format": "select",
"enumSource": [["Will be dynamically overwritten"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative way to do this, which I think might address the issue about custom validation below, would be to create Grout endpoints that provide a list of the IDs of valid referents, which you could then include via $ref. That might look something like this:

"referenceTarget: {
    "allOf": [{ "$ref": "https://my-grout-server.com/api/record-types/<UUID>/referents.json" }]
}

and then the API would return something pretty simple:

{
    "type": {
        "enum": [ <list of valid UUIDs> ]
    }
}

The benefit of this would be that a standard JSONSchema validator would be able to follow the reference, retrieve the list of valid UUIDs, and validate the record, without making any changes to library functionality.

The downside (and this is similar to the null-pointer issue, below) would be that the schema would become somewhat slippery, in the sense that deleting Records referenced by other Records could cause those records to no longer validate, even if they were valid when they were created and the base schema hasn't changed. This might be able to be addressed by doing some additional bookkeeping behind the scenes such that when a potential referent is deleted, we also either delete or nullify (depending on whether the schema allows nulls) any records that depend on it.

Figuring out authentication might also be tricky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the $ref pattern you sketched out a lot -- really nice that it can express the idea in pure JSONSchema!

It seems like you're suggesting implementing a kind of limited ON DELETE standard, which would make a lot of sense. At the very least, allowing the equivalents of CASCADE and SET NULL seems reasonable to me.

I still don't quite understand the validation issue, however. My understanding is that a Record only gets validated when it enters the database (via POST) or is updated (via PUT). To me, it seems desirable to complain loudly in those two cases if a reference doesn't resolve properly. When else might we expect validation to get triggered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial concern with validation was that we currently have as an invariant that jsonschema.validate(record.schema.schema, record.data) (made-up function) is always true. This would make that no longer an invariant because some parts of the schema would be dynamic based on database state, but I didn't have a more concrete worry than that. Now that you've narrowed things down to creation and update (which I think is correct), I think that my primary concern is about the update case -- it doesn't seem very nice to allow a record to silently get into an invalid state, such that an unsuspecting user could try to make an edit that was unrelated to the external reference and find themselves receiving validation errors. I think that having ON DELETE equivalents would still solve the issue, however.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a case where the JSON schema standard hasn't quite caught up with the usage we want, but it's under consideration for draft 4. I also stumbled across some good discussion here: json-schema-org/json-schema-spec#514 regarding delegation versus inclusion for reference targets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's probably okay then; it looks like they're up to Draft 7 now: http://json-schema.org/specification-links.html#draft-4 . It's very confusing that they just call everything drafts; I'm unclear whether any of these have been finalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is confusing. The json-editor library README claims compatibility with both versions 3 and 4, so it's unclear where they intend to stand on reference behavior then.

As an aside, I just noticed a Vue JSON schema validator exists. It's linked to in the schema docs, here. Maybe that would be helpful to look at for some of the front-end work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR that closed the delegation-vs-inclusion issue was just merged eighteen days ago, so this aspect does seem to be under current development.


#### Lookups

Since Record-to-Record foreign keys would store UUIDs (as strings) for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some more detail here; we do some queries in DRIVER that are quite a bit more complex than the API lookups that the base Grout API offers. I don't think we necessarily need to be able to accomplish exactly the same queries with external references as we can accomplish with internal references, but it would be helpful to know where the boundaries are.

In particular, the query I'm thinking of is for generating some of the analysis that DRIVER provides: https://github.com/WorldBank-Transport/DRIVER/blob/master/app/data/views.py#L332

That query is a bit dense, but the gist of it is that it allows you to ask questions like "for each different vehicle type, how many crashes involved vehicles of that type during each of the past 12 months?" Currently, that can be done with a single (complicated) query (well, two, I think, one to get the schema up front). You could imagine wanting to ask a similar question about external references as well, for example, "For each Event genre (assuming you have some sort of classification system for Events), how many posters were recorded over each of the past 12 months?"

The reason I'm asking this is that I think a lot of people reach for a normalized data model by instinct, whereas Grout currently only offers a denormalized data model. If we add normalized data model capabilities, I think one thing that'll be important is to have a clear understanding for when to choose one over the other, and knowing if there are any limitations on the analysis you can do with normalized data will be a big part of developing that understanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thoroughness here -- I breezed past this section sort of assuming it was solved, so it's helpful to have some concrete problems to look at. I don't have much to say about it now since I'm still getting my head around the particular problem, but I agree that knowing the limits of this approach for querying ahead of time is critical.

Copy link
Contributor Author

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your detailed review on this, @ddohler! You identified a lot of areas that I hadn't put enough thought into, which was exactly what I was hoping for. I left some replies to your comments, but if writing about this PR is becoming a big time sink, you should feel free to table it for now -- I'd be happy to have a quick meeting at some point this week to clarify our understanding together.

"title": "RecordType of the related Record",
"type": "string",
"format": "select",
"enumSource": [["Will be dynamically overwritten"]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the $ref pattern you sketched out a lot -- really nice that it can express the idea in pure JSONSchema!

It seems like you're suggesting implementing a kind of limited ON DELETE standard, which would make a lot of sense. At the very least, allowing the equivalents of CASCADE and SET NULL seems reasonable to me.

I still don't quite understand the validation issue, however. My understanding is that a Record only gets validated when it enters the database (via POST) or is updated (via PUT). To me, it seems desirable to complain loudly in those two cases if a reference doesn't resolve properly. When else might we expect validation to get triggered?

Null pointers are indeed a problem with Record-to-Record foreign keys, but this
problem is already endemic to Grout: if a field is removed from a schema during
an upgrade to a new version (i.e. a definition of a new RecordSchema), Grout
silently permits the field to remain but become unaccessible. From the reverse
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! That's an important distinction. So a more accurate breakdown of "accessibility" is:

Field type is accessible is filterable
Normal field X X
Deleted field X
Deleted Record-to-Record reference

perspective, if a field is _added_ in the new schema, old records will not store
values for that field, and so will not appear in the relevant filters.

While not precisely a "null pointer," the issue is functionally the same: some data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point! When I was first investigating the data model, I put some thought into versioning for Records (notes here: azavea/ashlar#80). Reading back on that, it seems to imply that Records and RecordTypes cannot be deleted, they can simply have an active value of True or False. I wonder if this is already exposed in the data collector frontend? If it's truly impossible for a user to fully delete a Record (without calling a manual SQL update), would that address some of your concern?


#### Lookups

Since Record-to-Record foreign keys would store UUIDs (as strings) for the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the thoroughness here -- I breezed past this section sort of assuming it was solved, so it's helpful to have some concrete problems to look at. I don't have much to say about it now since I'm still getting my head around the particular problem, but I agree that knowing the limits of this approach for querying ahead of time is critical.

a `RefResolver` class that it uses in the `validate` function for [resolving external
references](https://python-jsonschema.readthedocs.io). It seems relatively
straightforward to extend this resolver to check references by pinging the Grout
API.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's indeed what I'm proposing here, and your criticism is on point: to move in this direction would be to tightly integrate with the particular implementation of python-jsonschema. Your proposal to use the $ref attribute instead seems like a much better approach.

- `jsonb_path_ops` indexes, which only support the containment operator `@>`
but are much faster (least flexible, most performant)

Since we currently only use the containment operator in Grout anyway, I see no
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know! I'll flesh out this section a little bit with information about Django's implementation. I think it would make sense to use the default index type, and then consider using jsonb_path_ops if performance becomes a serious concern.

@jeancochrane
Copy link
Contributor Author

jeancochrane commented Jul 18, 2018

Your comments about normalized data models got me thinking. For reference, here's what you said in the review:

The reason I'm asking this is that I think a lot of people reach for a normalized data model by instinct, whereas Grout currently only offers a denormalized data model. If we add normalized data model capabilities, I think one thing that'll be important is to have a clear understanding for when to choose one over the other, and knowing if there are any limitations on the analysis you can do with normalized data will be a big part of developing that understanding.

This struck me as very true, and had me wondering: How could I accomplish what I wanted without a normalized data model? Is there an inherent hierarchy to the data that is relevant to the demo app?

I took a moment to sketch the data model in normalized/denormalized fashions. In an ideal world, here's what the normalized data model would look like (apologies for the ad hoc pseudo-data-model syntax):

EventSeries {
    title
}

PosterSeries {
    title
}

Event {
    date/time
    location
    series -> EventSeries
}

Poster {
    date/time
    location
    image
    series -> PosterSeries
    event -> Event
}

To my surprise, the data model is decently hierarchical. It fits nicely in a denormalized model:

EventSeries {
  title
  events [
    Event {
      date/time
      location
      posterseries [
        PosterSeries {
          title
          posters [
            Poster {
              date/time
              location
              image
            }
            Poster {
              date/time
              location
              image
            }
          ]
        }
      ]
    }
  ]
}

Within each of the nested objects, you could imagine many children -- i.e. an EventSeries can contain multiple Events, an Event can contain multiple PosterSeries, and a PosterSeries can contain multiple posters (as it does in the example above).

So, it seems there's a nonrelational way to store the information I care about in this app. Immediately, a few questions jump out at me:

  1. What information do we lose by denormalizing the data model? Aggregate queries will become more expensive, particularly for geospatial questions -- e.g. "How many Posters are within a mile of their Event"? But in general, I'm surprised by how much of the model is captured by a pure hierarchy.

  2. Is this possible under the current Grout data model? Not really. While we could potentially convert the data model above into a RecordSchema for a single parent RecordType (EventSeries), there are a few problems we would have to address:

    • The schema relies on nested objects, which are not supported in the schema editor
    • The schema stores geospatial information inside nested objects, as opposed to on the top-level of the Record
      • This violates the expectations of the data model, and also isn't supported in the schema editor
  3. If this isn't possible under the current Grout data model, is it more difficult to accomodate this type of work, or to allow foreign keys? I go back and forth on this one. Curious to hear what you think!

Copy link

@flibbertigibbet flibbertigibbet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great discussion.

This can be revisited later as our understanding of needs evolves.

@jeancochrane
Copy link
Contributor Author

jeancochrane commented Jul 31, 2018

I integrated many of our discussions into the ADR (and linked to the threads where appropriate), and updated the "Implementation" section with a more robust sketch of the schema that can validate using $refs.

I think I'm feeling comfortable moving forward with this, but I'm also unsure how to prioritize it. It seems like this feature is critical for the demo app, but potentially not a major feature for prospective new users -- my inclination is that the latter is more important than the former in our remaining fellowship time. I'm leaning in the direction of prioritizing other work to improve the schema editor (#21 and #29) before embarking on this journey.

Copy link
Contributor

@ddohler ddohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I agree with you that this puts us in a good place to move forward with implementation if we want to. I added a couple more issues that I thought of with this scheme, but I don't think they are dealbreakers.

I also agree that given the amount of time we have left and the fact that the demo app is in a pretty good place, it may make more sense to lock in some concrete (and important!) usability improvements rather than trying to tackle a major new feature that may end up incomplete or rushed.

all Records for a RecordType regardless of whether or not they're active could
potentially be confusing for an end user doing data collection work, both
because it might be hard to tell which Records are active and because the
endpoint may return overwhelmingly large lists of Records. Records
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential issue is that the referents.json endpoint could end up timing out or blowing up the memory of the JSON-Schema validator if the number of records is very large.

This also might require referents.json to be public, which could be problematic under some situations. Ordinarily I wouldn't think that a list of Record UUIDs would be an issue but depending on the type of information being stored, just the count of active records could potentially be sensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points! I added two notes to this section based on your comment.

@jeancochrane jeancochrane merged commit eb6b62a into master Aug 1, 2018
@jeancochrane jeancochrane deleted the foreign-key-adr branch August 1, 2018 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants