Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change taxonomy upsert behavior on server start #1007

Closed
mfbrown opened this issue Aug 23, 2022 · 14 comments · Fixed by #1040
Closed

Change taxonomy upsert behavior on server start #1007

mfbrown opened this issue Aug 23, 2022 · 14 comments · Fixed by #1040
Assignees

Comments

@mfbrown
Copy link

mfbrown commented Aug 23, 2022

As a user, when the Fidesctl server restarts I don't want the taxonomy upsert to update existing default taxonomy elements, so that any field level additions I've made to the taxonomy are preserved.

This means that should we add new properties with default values to the existing taxonomy in future updates, a migration script will be needed so that the new properties are applied alongside the existing, user edited, elements.

Taking a look at this case with regard to the data use element, I should be able to provide values for the properties of name, description, legal_basis, special_category, recipient, etc which by default don't have values. If a new fideslang update is released which adds a property to data_use, then when that version is included in Fidesctl, a migration script is created to add the value and it's default to the data_use element.

Any other time the server restarts and the language is not updated, only new language elements should be added. Existing elements should not be overwritten.

Acceptance Criteria

  • Given that I have a Fidesctl server, when the server restarts and there are NO taxonomy updates, then no changes to the default taxonomy are loaded.
  • Given that I have a Fidesctl server, when the server restarts and there is one or more new taxonomy objects on the same level as data category, data use, etc, then only the new objects are upserted into the taxonomy on the server and any existing extensions created by the user are maintained.
  • Given that I have a Fidesctl server, when the server restarts and there are new properties added to existing taxonomy objects, then a migration script is run to add the necessary default properties and values to the Fidesctl server.

This is meant to address this issue: #858 (comment) which is introduced by the fact that the taxonomy will be updated in the future and the state persisted potentially only in a database and not in YAML manifests which could be re-posted to the server after the default taxonomy is loaded.

@allisonking
Copy link
Contributor

From what I can tell, this will require changes to:

  • load_default_taxonomy in database.py which can no longer just call the upsert_resources function. It will instead have to do a diff of the default fideslang taxonomy with the user's taxonomy.
    • If there are no differences between the default and the user taxonomy, do nothing 🎉
    • If there are more keys in the user taxonomy than the default, don't need to do anything 🎉
    • If there are more keys in the default than the user taxonomy, add the new defaults to the user's taxonomy
    • If there are new properties (i.e., as a recent example, we added an is_default field to all taxonomy objects). I believe an alembic migration would handle this, especially if the field has a default value.
      • However, this gets trickier if the new property needs something different per row.
        • In the is_default example, we would have only wanted the default taxonomy fides keys to get the is_default=True property. A default alembic migration would give all taxonomy fields is_default=False (since the default value is False). One possibility would be to alter the alembic migration to check for the default taxonomy fides keys, and only set those to is_default=True
        • Another example would be if we added a field like nickname to all taxonomy objects. In the past, we could just edit the default taxonomy object and give everyone a relevant nickname. But in a world without the upsert, we'd have to write our own thing, probably keeping a map of fides_key: nickname somewhere, and then doing the update per key.
        • I think a reasonable way to do this would be via the alembic migrations since they already run. These would be data migrations instead of the more common schema migrations. The upside is alembic will keep track for us if the migration needs to run, and it's already part of our workflow to run them when needed. The downside is that I've always found alembic migrations difficult to test.

does this sound like what you were thinking @ThomasLaPiana ? if so, then I suppose this ticket would be about just handling when new default fields are added, since updates would be left to alembic and whatever new feature is being introduced

@ThomasLaPiana
Copy link
Contributor

@allisonking I see this and I'm chewing on it

@allisonking
Copy link
Contributor

Sounds good! I made a PR for handling the new default fields that should be ready to go while we figure out what to do about data migrations #1040

@ThomasLaPiana
Copy link
Contributor

hmmm I see, this is tricky.

I agree that alembic migrations are a logical solution here (fidesops does include data migrations already in theirs), but like you said they can be difficult to test and verify...

I still feel like somehow we should be able to fit this into the existing function, but my brain is still trying to figure out exactly how...definitely through some leveraging of the update_resources function? We can compare certain fields for changes and then run targeted updates? it's not clear to me yet

@allisonking
Copy link
Contributor

one more use case to add to this issue: do we want to be able to delete a user's default taxonomy fields? Users cannot create or delete their own default taxonomy fields (at least via the API).

If so, this could also fall in the data migration category, or we could add some logic to load_default_taxonomy to also compare the user's defaults with our defaults, and delete the ones of the user's that don't exist in ours anymore. Easier than checking for updates!

@ThomasLaPiana
Copy link
Contributor

Is there any way we could solve all of this by doing a per-object comparison? With this string of events...

  1. Server boots up
  2. Grab all data categories from the server
  3. Compare on a per-object basis with the fideslang data categories
  4. targeted field updates as needed
  5. Repeat for the other taxonomy classes

If we can get a clear per-object diff, it seems like we should be able to write the logic in a way that would handle updates

@allisonking
Copy link
Contributor

Let's consider the recent use case where we added an is_default field.

  1. Server boots up
  2. Alembic migrations run and add an is_default. It's nullable though, so this column will be blank
  3. Grab all data categories from the server. These now have is_default fields since the alembic migration already ran, and our pydantic model also now has is_default as a field
  4. Therefore I don't think comparing on a per-object basis would work, since the migration already ran and our fideslang data category already says is_default should exist
  5. Let's say we could get the comparison to work though, and we now realize there is a new is_default field. There needs to be some additional logic that says "all the fields in the default taxonomy get is_default=true and everyone else gets is_default=false." Where should this logic live? This type of logic is per use-case and not really something we can have one script for (similar to alembic data migrations—they live in their own place and have their own logic for when they are applied)
  6. Repeat for the other taxonomy classes

Does that sound right to you? If so, I think this would be more complicated than just getting a diff

@allisonking allisonking reopened this Sep 9, 2022
@allisonking
Copy link
Contributor

Status of this ticket:

  • Creating new default taxonomy fields is handled Alter taxonomy upsert behavior #1040
    • No longer overwrites user edited default taxonomy fields
  • Updating default taxonomy fields still in discussion (see above thread)
  • Deleting default taxonomy fields? Should this be handled? (see this comment for more details: Change taxonomy upsert behavior on server start #1007 (comment)) The previous upsert probably didn't handle this since it was just an upsert? What did we do when we got rid of user.provided and user.derived?

@ThomasLaPiana
Copy link
Contributor

ThomasLaPiana commented Sep 12, 2022

Let's consider the recent use case where we added an is_default field.

  1. Server boots up
  2. Alembic migrations run and add an is_default. It's nullable though, so this column will be blank
  3. Grab all data categories from the server. These now have is_default fields since the alembic migration already ran, and our pydantic model also now has is_default as a field
  4. Therefore I don't think comparing on a per-object basis would work, since the migration already ran and our fideslang data category already says is_default should exist
  5. Let's say we could get the comparison to work though, and we now realize there is a new is_default field. There needs to be some additional logic that says "all the fields in the default taxonomy get is_default=true and everyone else gets is_default=false." Where should this logic live? This type of logic is per use-case and not really something we can have one script for (similar to alembic data migrations—they live in their own place and have their own logic for when they are applied)
  6. Repeat for the other taxonomy classes

Does that sound right to you? If so, I think this would be more complicated than just getting a diff

Thanks for the breakdown! My brain is finally tracking now. Agreed this is as complicated as you thought it was.

As for when we removed those fields, I don't think we did anything actually, which isn't good. Our tests/checks ran on a fresh database, so nothing would've been flagged. Any user that came from an older version would still have those in their database

The right thing would've been to add a section to an alembic migration to check for those values and remove them if they existed (fidesops did implement data migrations, not just schema migrations, for the big fideslang change)

@allisonking
Copy link
Contributor

Here's my proposal for now:

  1. Add some logic here to handle deleting no longer used default taxonomy fields (since we are already doing a comparison with fideslang's default vs. the user's). This will then be able to handle use cases like user.derived being removed (and will also fix us having not done something about it before, I believe!)
  2. For updates to the default taxonomy, use an alembic data migration. Which means we wouldn't be able to put anything in place right now for that, other than perhaps some documentation?

@ThomasLaPiana
Copy link
Contributor

That sounds good to me! Also might be time to refactor that function a bit, as load_default_taxonomy is now more update_default_taxonomy and that function is getting complicated!

@allisonking
Copy link
Contributor

Actually it just occurred to me that perhaps we don't want to delete like this in case the user is using one of the to-be-deleted taxonomy fields. Maybe we can first check if the field to be deleted is being used in any of the user's specs. If it isn't, then maybe it's safe to delete? Otherwise... log out a deprecation warning? Or maybe just change that field to is_default=False and keep it around?

@ThomasLaPiana
Copy link
Contributor

Actually it just occurred to me that perhaps we don't want to delete like this in case the user is using one of the to-be-deleted taxonomy fields. Maybe we can first check if the field to be deleted is being used in any of the user's specs. If it isn't, then maybe it's safe to delete? Otherwise... log out a deprecation warning? Or maybe just change that field to is_default=False and keep it around?

I think its actually ok to break here. If the user bumps fides version, and it causes a break because that default item no longer exists, that seems reasonable to me? Then they would be "forced" to stay up-to-date with the current fideslang spec. Not sure how valuable that is though? Is it more valuable not to break their stuff? idk

@allisonking
Copy link
Contributor

maybe we can let @mfbrown decide :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants