Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling STAC extension version upgrades #448

Closed
duckontheweb opened this issue Jun 16, 2021 · 18 comments
Closed

Handling STAC extension version upgrades #448

duckontheweb opened this issue Jun 16, 2021 · 18 comments
Labels
breaking Represents a potentially breaking change that may warrant a major version release discussion An issue to capture a discussion
Milestone

Comments

@duckontheweb
Copy link
Contributor

duckontheweb commented Jun 16, 2021

Moving here from stac-spec Discussions thread.

Now that STAC extension versions are no longer tied to the core STAC version or to each other, we may run into situations where a user is working with catalogs that implement some combination of a core spec version and extension versions that are not all supported by the same version of PySTAC.

Here are a few ideas on how to handle this, but other suggestions are welcome. It would also be good to get feedback from maintainers of downstream libraries on what would work best.

1. Separate Extension Repos

STAC extension implementations would each have their own repo that would follow its own versioning (e.g. pystac-extension-label, pystac-extension-file, etc.). This would probably also require creating a pystac-extensions-base package that would be a dependency of each extension implementation.

Pros

  • More accurately reflects the structure of the spec with extensions separate from the core spec
  • Developers can mix and match any combination of extension versions they need
  • No need to upgrade PySTAC version with each extension version update
  • Simpler PySTAC core code base

Cons

  • Means a lot more repos in stac-utils org
  • Could lead to dependency conflicts if different extension implementations are using different versions of pystac-extensions-base
  • Developers need to explicitly install each extension they want to use (or maybe we have an all-extensions extra in the pystac install that grabs all of them...)
  • Sounds like a lot of work and may delay a 1.0 release

2. Version-specific Extension Classes

Instead of a single set of extension classes (e.g. EOExtension, ItemEOExtension, AssetEOExtension), we create a set of classes for each extension version (e.g. EOExtension_1_0_0, etc.). We still have a top-level extension class (e.g. EOExtension) with an ext method that would handle parsing the extension schema URI and delegating to the right version-specific class.

Pros

  • Fewer repos to maintain
  • Developers will continue to get extension support within PySTAC without needing to install other extension-specific dependencies
  • May be able to limit the outward-facing changes if the behavior of ext stays mostly the same as it is now
  • Might be able to implement this incrementally (i.e. only for extensions that have actually undergone an upgrade)?

Cons

  • More complex extension code base within PySTAC (may need to create version-specific modules within each extension to keep the files from getting too bloated).
  • Typing may be difficult and/or confusing (e.g. calling EOExtension.ext and getting an instance of EOExtension_1_0_0 back might be non-intuitive).

3. Only Support Latest Version (Status Quo)

We continue to update the existing extension code to support each new extension version as it is released and do not provide support for multiple extension versions within the same version of PySTAC.

Pros

  • Lowest development effort
  • Simpler extension code than other 2 options

Cons

  • Does not solve the issue raised above. It would be on users to find a workaround.

cc @lossyrob @m-mohr @gadomski @jbants @matthewhanson

@duckontheweb duckontheweb added the discussion An issue to capture a discussion label Jun 16, 2021
@duckontheweb duckontheweb added this to the 1.0.0-rc.1 milestone Jun 16, 2021
@gadomski
Copy link
Member

Some questions:

  • Do any of the currently-supported extensions add dependencies to PySTAC (i.e. if we split up to separate repos, would install_requires get smaller)?
  • Right now, are extensions implemented with a plugin and/or namespace structure such that third party extensions can hook into the PySTAC API? Not saying they need to be and I'm not sure if that'd be particularly useful, but just curious.

Speaking from a stactools perspective, the best case scenario would be a PySTAC that came with "batteries included"; it would work with all core extensions without needing to depend on additional repositories. The version thing is a little trickier -- I think it's non-ideal if downstream code has to add logic to deal with different extension versions, that feels like a PySTAC thing to me. I think Option 3 (Only support latest version) is my initial instinct, with maybe some additional super-powers to "bring forward" older versions to the latest version when it makes sense.

@lossyrob
Copy link
Member

Answers:

  • No extensions add dependencies; PySTAC sticks to metadata so there shouldn't be reason to add dependencies for extensions and we've been aggressive in keeping the dependency footprint as small as possible
  • Extensions are implemented as a plugin architecture. They were more so prior to the 1.0 refactor, but now there's a light(er) weight hooks that an extension can register (example hooks and registering) that extensions written inside or outside of the main project can use to hook into PySTAC inner workings in order to accomplish extension logic (e.g. the label extension adds a "source" link which should be treated as a linked STACObject when PySTAC is resolving a catalog).

with maybe some additional super-powers to "bring forward" older versions to the latest version when it makes sense.

PySTAC already migrates things to the latest supported version, including extensions. That's part of the hooks for an extension - it provides the extension the ability to upgrade an object to the latest version.

I agree Option 3 is the way to go; the additional maintenance burden and code spread of Option 1 I think would outweigh the burden to upgrade to new versions of extensions as they are released. PySTAC has gone with an approach of "use the latest version to read any version, but write the latest version. If you need to write older versions, use older versions of PySTAC". I think that also applies here. Option 2 might be a better solve, but would be a lot more complex and potentially an even greater maintenance burden, and would confuse the "always upgrade to latest" approach PySTAC takes now - if we do this for extensions, why not for core objects? Maybe Option 2 is a direction we should explore if there's enough use cases where upgrading to the latest (or needing to upgrade PySTAC to a newly released version of an extension) is a big issue for users.

@duckontheweb
Copy link
Contributor Author

That all makes sense to me. If we're comfortable with Option 3, at least for now, I won't include this in the RC1 milestone. I mostly wanted to make sure we decided on any breaking changes prior to that release.

I've been a little uncomfortable with the fact that we automatically migrate objects when deserializing them (although I also see the benefit of it). I think that deserves a discussion of its own, however, so I'll open a separate issue for that.

@duckontheweb duckontheweb removed this from the 1.0.0-rc.1 milestone Jun 17, 2021
@emmanuelmathot
Copy link
Contributor

emmanuelmathot commented Jun 18, 2021

@duckontheweb to answer your question in previous discussion: In DotNetStac, I implemented Option 3 with a versioning mechanism reading previous versions (if implemented) with upgrading functions to latest version.

@m-mohr
Copy link
Contributor

m-mohr commented Jun 18, 2021

FYI: I basically do the same in STAC Browser, all data is automatically converted to the latest version (although STAC Browser only does read and no write, of course). This is done through stac-migrate, but I have not implemented e.g. updating file extension 1.0 to 2.0 (+ raster 1.0) for example.

@vincentsarago
Copy link
Member

👋 That's an interesting discussion.

From a user POV, right now I'm a bit worried. I wanted to use pystac+extensions but I'm afraid about having to keep an eye on either pystac version and version of the extension within each extension submodule.

I was a bit surprised to find all extensions from stat-extensions available in pystac, and while this is nice it might get messy in the future. (IMO, I would have deferred the management of each extension to each repo but some maintainers might not use python at all).

I'm 👍 on Option 3, because it seems the simplest (😬 I'm not a contributor so you can ignore me).

@duckontheweb
Copy link
Contributor Author

One possible hiccup to option 3 is the case where we cannot fully migrate fields from a previous extension version into the most recent version. The File Info Extension may be an example of this (see discussion in #472). In that case, some of the v1.0.0 fields were dropped in v2.0.0 and moved into the Raster Extension. However, there was no guarantee that those fields were being applied to a raster file originally so it may not be appropriate to create a Raster Band for them.

This does not necessarily break anything because they can still access the fields directly via extra_fields, but it might lead to some confusion.

@gadomski
Copy link
Member

gadomski commented Feb 1, 2022

@duckontheweb resurfacing this issue as something that should probably be addressed as part of a v2.0 release. As the STAC ecosystem matures the domain of extension versions will only grow, and if we want people to use PySTAC's extension API (instead of going directly to properties), we should be able to gracefully handle older/newer versions without throwing. This problem is cropping up now in the wild w.r.t. the raster extension (initially brought to my attention by @Kirill888).

@Kirill888
Copy link

I think that exact version match approach will have significant negative impact. Effectively this makes it really hard to evolve extensions. One would need to upgrade an entire collection every time a new version of an extension your collection uses changes. Even if that change is a pure addition of new optional fields. Even if I could downgrade/upgrade installed extension versions independently per extension, it would not solve the problem of using multiple data sources within the same environment, not all collections will be in sync and sometimes you want data from many data sources.

If my currently installed software supports extension A-1.0 and I encounter data that has A-1.1, I would expect to be able to read A-1.0 fields out of the extension. And similarly A-1.1 code should be able to parse A-1.0 data.

Newer code/older data case is certainly possible, even if you have no constraints at all on format evolution, in the worst case you have older implementations included as renamed copies in the newer code that newer code can dispatch to, and possibly include on-the-fly translations (extra code to support migration will be needed).

Older code/newer data case is harder/more dangerous. For that to work one would need to agree up front on what changes are allowed as you go from A-1.0 to A-1.1. IF one promises to not change existing fields between minor version upgrades and only allow additions of new fields, then older code can safely parse newer data and you are not punished for using latest and greatest version. This would enable easier evolution of extensions as you are not forcing software or data upgrades while still allowing to capture more metadata as new use-cases are discovered. It does require some extra planning and more careful format evolution, but I don't think it will cost a lot in terms of complexity of the implementation. If you want to allow more complex format evolution than just field addition, while still maintaining forward compatibility, then you might choose to add some more complexity in the extension code, like for example being careful at interpreting some enumeration types with an idea that you might encounter newer values in newer versions and handling those gracefully. But that's a choice that can be made per-extension.

At the very least it should be possible to force extension code to interpret whatever data is present even if it's not quite the right version without having to patch stac_extensions property in code or on disk.

@duckontheweb duckontheweb added this to the 2.0.0 milestone Feb 2, 2022
@duckontheweb duckontheweb added the breaking Represents a potentially breaking change that may warrant a major version release label Feb 2, 2022
@duckontheweb
Copy link
Contributor Author

It sounds like there is pretty broad interest in supporting multiple extension versions within a single version of PySTAC, and I tend to agree that we should not force an extension upgrade behind the scenes. Adding support for new extensions and new versions of extensions is often mixed in with other bug fixes and feature enhancements as part of a single release. This means that if someone upgrades their PySTAC version to take advantage of some new feature or fix they may also be railroaded into using a newer version of one of the extensions. Users may have a good reason for sticking with an older version of an extension (e.g. maintaining consistency across an existing API/catalog) and shouldn't have to go to great lengths to do so just because they updated their PySTAC version.

All of that being said, I'm still struggling with how to implement this in a maintainable way within the library. It seems like we would need separate PropertiesExtension classes for each version within an extension. If we assume that each extension is following semantic versioning then we might be able to re-use some code by having minor and patch versions inherit from and extend a major version base (e.g. LabelExtension_1_0_1 inherits from LabelExtension_1_0_0). I could be wrong, but it seems like there could be a potential for the code to get very complicated very quickly. Does anyone have thoughts on a better implementation?

@gadomski
Copy link
Member

gadomski commented Feb 2, 2022

I agree that explicitly enumerating supported extension versions would be both unwieldy, and wouldn't really solve the problem of unsupported extension versions in the wild unexpectedly blowing up in a user's face.

One idea could be to leverage extension prefixes to collect any fields not explicitly supported by PySTAC's supported extension version; this could be coupled with warning to alert the user to the presence of extra fields. Something like:

foo = FooExtension.ext(item)  # warning: unsupported extension field(s): `foo:bar, foo:cheers`
assert foo.additional_attributes == { "foo:bar": "baz", "foo:cheers": True }

...or something like that?

@Kirill888
Copy link

I'm not familiar with PySTAC codebase so can't comment on the implementation options, but from the user perspective I would like to see the following, note that this is a "wish list" and "an ideal scenario for evolution". The ideal ideal scenario is that everything is defined so well at the start that no change is needed at all, but that's not a thing that anyone can deliver. At the very least these are the things to keep in mind when suggesting a format change or designing an implementation.

On the software side of things

  • Properties that were exposed on extension A-1.0 remain available on A-1.1, possibly with deprecation warnings
    • So for example if a field is renamed on the access side, I should still be able to access it by an old name
    • If a field was renamed "on the wire", it should still be exposed with an old name as an alias, at least for a little while. I realize that this means that extension code will reflect not just the current state but will also have to encode some amount of the history. But that change should really be bounded.
  • Code written against A-1.1 should not need to change when reading A-1.0 data.
    • If a field now has a different name, I should be able to access it with the new name even if it was stored with an old one.
    • If a field was optional before but is now compulsory, and my A-1.0 data doesn't have it and there is no sensible default, then error out.
  • Best effort support for reading newer data with older software
    • Ideally, guaranteed "forward compatibility", at least for "particularly important" extensions. This requires constraints on what kind of data format change is allowed when moving from 1.0 to 1.1, and maybe some extra code to gracefully handle some kinds of future changes, like extending enumeration types, or having extra fields in some dictionaries (no blind call_something(**x)). But I feel like those kind of guards should be present anyway to better support corrupt inputs.
    • Not concerned about missing out on the new fields or even knowing they are present: when reading 1.1 data with 1.0 software, all I care is that 1.0 data can be accessed.
    • For people who want better data integrity guarantees one could have a "fail if data version is newer than software version" configuration option of sorts. But I don't think this should be on by default, not for 1.0->1.1 kinda change.

On the format side

  • Extending is fine, but change of existing fields is undesirable
    • Completely new fields (easiest)
    • New values, like new enumeration types or extra keys in dictionaries, requires some future proofing in code, but no much more than dealing with corrupt inputs.
  • Tightening is probably fine: making previously optional fields compulsory, or reducing acceptable input types to a smaller set.
  • Renames not so good as they can't be possibly supported by older software, unless you duplicate the data with an old name as well.
  • Would be nice to have FROZEN set of attributes defined, fields that can be relied to stay stable going forward.
    • if that can be encoded in software, even better.
  • Similarly to previous: some indication that field is still "experimental" and is likely to be refined in the future versions.

@m-mohr
Copy link
Contributor

m-mohr commented Jun 10, 2022

I've also recently surfaced this again when writing stuff for a stactools package where I has to use older (file) and newer (raster) extension versions. I'm also having this issue in the JS ecosystem (stac-migrate). A solution should be discussed in the stac-spec community in general, I think.

What should not happen is that is you read an older version that it throws an error. Then PySTAC should at least handle it as an unknown extension (via stac_extensions and extra_fields directly) instead of using the (incompatible) extension interface. But ideally it would try to migrate somehow, I think. Isn't that already done with core STAC?

Kirill888 added a commit to opendatacube/odc-stac that referenced this issue Jul 24, 2022
newer version of raster extension breaks tests.

We'll need to work around overly strict version
checking in pystac library or just not use extension
classes and look for data in dicts directly, it's
not realistic to expect all data and all software
to always use compatible versions in any given
installation.

stac-utils/pystac#448
Kirill888 added a commit to opendatacube/odc-stac that referenced this issue Jul 24, 2022
newer version of raster extension breaks tests.

We'll need to work around overly strict version
checking in pystac library or just not use extension
classes and look for data in dicts directly, it's
not realistic to expect all data and all software
to always use compatible versions in any given
installation.

stac-utils/pystac#448
@gadomski gadomski mentioned this issue Jan 9, 2023
10 tasks
@jsignell
Copy link
Member

I've been messing around with the interface from objects to extensions (#1051) and I think that if we change the invocation, option 2 becomes a little more attractive. In that world the user really doesn't care what the Extension class is called at least.

With the idea of registering extensions we could theoretically open the door for custom extensions to write their own classes and register them with the pystac objects. I think there isn't a big risk of namespace collision because the prefix probably has to be pretty unique already and that would be the namespace.

@gadomski
Copy link
Member

gadomski commented Apr 5, 2023

If we needed more motivation to fix the extension situation: opendatacube/odc-stac#105.

@gadomski
Copy link
Member

Some thoughts from today's stac-utils working group meeting:

  • Folks are using the extension classes (as opposed to just working with dictionaries) and so there is value in fixing this problems
  • Documentation needs to be improved on how to use/write extensions
  • We proposed the idea of using the extension maturity to keep/split extensions from core PySTAC. Perhaps only stable extensions live in PySTAC, and other extensions live in their own repos or in the actual extension repo. This model will force us to make an extension architecture that is easily used by outsiders, since our own non-stable extensions will be "outsiders"

@gadomski gadomski modified the milestones: 2.0, 1.9 Sep 27, 2023
@gadomski
Copy link
Member

@jsignell now that #1231 is merged, is there anything else that we need to do for this ticket? We'll just update the extension implementations as new versions are released, and between migration (for old versions) and permissively accepting future versions (thanks to #1231) we're hopefully good?

@jsignell
Copy link
Member

yeah I think that's right. We can go ahead and close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Represents a potentially breaking change that may warrant a major version release discussion An issue to capture a discussion
Projects
No open projects
Status: Done
Development

No branches or pull requests

8 participants