Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest: by SemVer or by timestamp? #132

Closed
aguschin opened this issue Apr 19, 2022 · 7 comments · Fixed by #244
Closed

Latest: by SemVer or by timestamp? #132

aguschin opened this issue Apr 19, 2022 · 7 comments · Fixed by #244
Assignees
Labels
question Further information is requested

Comments

@aguschin
Copy link
Contributor

gto which rf prod should show the latest version of rf promoted to prod. There are two approaches on selecting the latest one: by SemVer and by timestamp.

E.g. if I promoted v1.0.0 first, and then promoted v0.0.1: latest by SemVer would be v1.0.0, while latest by timestamp would be v0.0.1.

Promoting non-registered versions makes things more interesting. E.g. if I promoted version from commit abc1234 and didn't register a version from that commit - the latest by timestamp would be abc1234 now.

The question is: what should we show by default in Studio and in CLI?

Since we treat SemVer as a number that positively correlates with model quality and stability, the default latest in my mind should be v1.0.0. But then if you have a single registered version of model in stage prod and then for some reason abandoned registering SemVer versions and just promote, you'll always have that SemVer in gto latest and gto show. As a more simple example, a user may be confused after promoting abc1234 because it doesn't appear in gto show by default or because gto latest doesn't return it.

As an option to resolve this we can keep --sort-by-semver default and add flag --sort-by-time that will give the recent promoted version/ref.

WDYT? cc @omesser @dberenbaum @dmpetrov @shcheklein

@aguschin aguschin added the question Further information is requested label Apr 19, 2022
@aguschin aguschin added this to the First GTO release milestone Apr 19, 2022
@aguschin
Copy link
Contributor Author

I assume that for latest version of an artifact, we want to sort by SemVer. Although we can add --sort-by-time and in the example above show v0.0.1 as more recently registered version.

A special question: if user registered 0 SemVer versions of his artifact, do we want to show (in Studio) commits in which the model was promoted to some stages as "versions" (i.e. "implicit" versions)? Or should we show empty fields? Also, do we want to show latest "implicit" version in "latest version" column or it should be empty?

And what happens if he stopped registering SemVer versions at some point? Do we stop updating versions?

Considering my assumptions, it would look like this:

# repo setup
$ gto register rf $REF1 v1.0.0
$ gto register rf $REF2 v0.0.1
$ gto promote rf $REF1 prod
$ gto promote rf $REF2 prod
$ gto promote rf abc1234 prod

# latest output
$ gto latest rf
v1.0.0
$ gto latest rf --sort-by-time
v0.0.1
$ gto latest rf --include-implicit  # "--include-implicit" already means "--sort-by-time"
abc1234

# which output
$ gto which rf prod
v1.0.0
$ gto which rf prod --sort-by-time
v0.0.1
$ gto which rf prod --include-implicit
abc1234

The same flags will apply to gto show.

Trying to make summary: I think we should treat gto show as a entrypoint to the repo. Thus it shouldn't show any non-explicit versions by default I think. But if you promote some hexsha while not registering version, it will still trigger CI and you'll act in that CI. Even though it's not shown in gto show (it will be shown in gto show artifact-name and gto history btw).

@omesser
Copy link
Contributor

omesser commented Apr 20, 2022

My opinions:

  • sort by time or semver - It's important to give both, obviously by semver order is more meaningful in most cases. latest hints at time though. Suggest gto latest rf prod (for time order) and gto greatest rf prod (for semver order)
  • behavior of which - Because stages are supposed to give us kanban-like information, I honestly don't see a lot of value in gto which rf prod just displaying 1 version/ref, if there are more in prod! this is also effecting studio design and is a production decision. I feel like "show me what is in production stage" should show a list, showing the latest puts us back in the stage=env/deployment/setup which we said is not the correct association. So we must not behave as though promoting v1.2.3 into prod stage means that v.1.2.2 is no longer in prod stage. so I think which (can also be ls) should show a list, and latest|greatest are clear enough as commands
  • refs/commits - Unpopular opinion - Why do we need to allow registering commits/ref at all? supporting them meaning we need to decide on an ordering behavior for them for "greatest" (=latest) which is...confusing, and mixing those with semver will always be messy. if you want semver order but have a mix output of semver and refs there is no correct way to show this

@aguschin
Copy link
Contributor Author

aguschin commented Apr 20, 2022

Why do we need to allow registering commits/ref at all?

Basically, because users may want to do that. And for sure they may create rf#prod git tag without creating rf@v0.0.1 in the same commit. Should we then ignore those commits? Don't think so, because it's unintuitive behaviour.

I honestly don't see a lot of value in gto which rf prod just displaying 1 version/ref.

This is what MLflow MR do btw with mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{stage}"). I assumed we need to add --all flag to which to show all versions. We can make this default behaviour instead and add some flag to output the first one only.

Also, this could be achieved with gto show rf --stage prod --name-only. Both flags aren't implemented, but they could be. Or even just gto show rf prod --name-only. May be this is a good replacement for which, since the default is to show all versions.

if you want semver order but have a mix output of semver and refs there is no correct way to show this

😐

@aguschin
Copy link
Contributor Author

@dmpetrov AFAIR you were the proponent of allowing to skip mandatory SemVer registration when promoting when we discussed this few months. Do you still think it shouldn't be mandatory?

@omesser
Copy link
Contributor

omesser commented Apr 21, 2022

About why allow registering commit/refs:

"because users may want to do that. And for sure they may create rf#prod git tag without creating rf@v0.0.1 in the same commit."

Consider this from my response in our discussion today - if the user uses gto to manage registry, then, the actual tags are implementation details. User shouldn't care or lose anything if, when promoting a "commit", gto would also be tagging their commit with a semver tag and allocating a "version" for their commit implicitly. On the contrary - adding a tag with a commit sha looks like pure clutter on the repo.

Approahces in order of correctness (IMO):

  1. Opinionated / single flow - The ideal solution would be to not allow promoting an unregistered version (you promote a semver only). If it's not registered it's not important enough to be part of the registry. if it is important enough, on the other hand - it deserves a "name" - a release / semver version from the user. Best be explicit and not guess it as well
  2. Aliased flow - if the aim is to allow gro promote nn production without gto register nn v0.0.1 (save some typing) - it can be done by implicitly registering upon promotion with the next semver patch version.
  3. Allowing "promoting" commits/refs - It makes it vague what "registering" means. Also, having to deal with mixing refs and semvers in output, sorting being incorrect / inconsistent, adding flags to allow users to see only semver/only-commit/both. Risking breaking upstream tools for the user if parsing gto output directly - will probably expect semver library to parse semvers, and suddently a commit sha will break it.

@aguschin
Copy link
Contributor Author

@omesser, getting back to GTO since MLEM release is on the way 😅

What you say makes sense to me, and I would like to agree considering the simplicity. But with each "strict" option we need to understand how GTO behaves if a user breaks it. If we don't allow a user to promote a model in a commit without registering it (explicitly in your option1 / implicitly in option2), then what GTO should do if a user creates that git tag by himself? I see these options:

  1. Ignore - pretend that promotion didn't happen. Don't show the promotion anywhere. Have a $ gto doctor cmd (or $ gto check-registry-correctness or anything else) that will let user know that he needs to register a semver.
  2. Issue warnings each time user calls some command. (Hey, you have a promotion, but don't have a version, go and register it!)
  3. Break each time user calls some command. Ask to register the version to fix this.
  4. Automatically register version.

I think 1st and 2nd are preferable. But still, they could be bad since promoting git tag will trigger CI if you push it. So in fact, a user may be confused: I created a tag (though manually), I pushed it, why am I getting some errors?

Also, another case to consider. Let's suppose I registered version and made a promotion. I've pushed promoting tag, but forgot to push registering tag. Then the other person clones the repo, and he's getting these warnings. 🤔 Not sure what we can do to fix this, but looks confusing. Maybe have some GTO cmd to push git tags and check that everything is ok like @dberenbaum and you suggested, e.g. here (can't find the other one by Dave).

@aguschin
Copy link
Contributor Author

aguschin commented Aug 31, 2022

In https://github.com/iterative/studio/issues/3968#issuecomment-1209073924 we decided it should be a sort by timestamp, and not by SemVer. Now we need to

  • change the default
  • support both sorts as optional args (--latest and --greatest)

@aguschin aguschin moved this to Todo in MLEM + GTO Aug 31, 2022
@aguschin aguschin self-assigned this Sep 5, 2022
@aguschin aguschin moved this from Todo to In review in MLEM + GTO Sep 5, 2022
Repository owner moved this from In review to Done in MLEM + GTO Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants