Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define versioning mechanism #91

Closed
pierrot0 opened this issue Jun 28, 2023 · 4 comments
Closed

Define versioning mechanism #91

pierrot0 opened this issue Jun 28, 2023 · 4 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@pierrot0
Copy link
Contributor

pierrot0 commented Jun 28, 2023

Versioning includes a few problems:

  • Versioning the Croissant format and referring to the version of the format used in a config.
  • Versioning the Croissant config: where should the config version be specified? How? What should be the guidelines here?
  • Live vs snapshot (reproducible) datasets.

Related discussion: #58
Related documentation to write: https://github.com/mlcommons/croissant/blob/main/docs/howto/versions.md

@pierrot0 pierrot0 added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 28, 2023
@PGijsbers
Copy link
Contributor

PGijsbers commented Jul 5, 2023

Related, but maybe a separate discussion: I think it would also be useful to have some information about the source of the croissant file. E.g., if it is from an openml converter, kaggle converter, manual creation, ..., and/or which version of that software was used to generate it.

@benjelloun benjelloun changed the title Versioning in Croissant Define versioning mechanism Jul 12, 2023
@pierrot0
Copy link
Contributor Author

Some considerations I started thinking about last week, before leaving for vacations. This is not exhaustive but maybe those can be used to start some discussions.

Those considerations are mostly focused on versioning the croissant spec.
They do not touch the versioning of dataset definitions nor checkpoint / live data, which have to be done too.

Each consideration is up for discussion.

1- Do we want the same version for spec and validator / loader?
It seems to me that we would logically want separate versions: loader API could change while spec version doesn’t change at all. That means one of:
multiple repos, but that would split discussions, issues, etc.
might make it more difficult to change spec and tools together, although they should arguably be decoupled, since many dataset specs will leave outside repo.
might make it more difficult to run tests (although again, it would force decoupling).
multiple branches / tagging patterns (one per project within repo): spec_v1.0.1, ml_croissant_py_v1.2.3.
I think at this point I would favor one repo, many branches / tag patterns. Maybe something like https://nvie.com/posts/a-successful-git-branching-model/ would work fine here.

2- Each dataset config specifies the Croissant config version it was written for. e.g.

{
  ...
  "@type": "sc:Dataset",
  "name": "dataset_name",
  "croissant": {
    “croissant_version”: “0.1.0”,
    ...
  }
}

Related discussion: #58

3- Use croissant_version follows https://semver.org/ naming rules
Write CHANGELOG.md following https://keepachangelog.com/en/1.0.0/ guidelines.

Keep version 0.x.y until we are confident the spec can be used to describe many dataset types and integration with 1+ tools / platforms is about to be ready.
Release 1.0.0 just before the first tool integration.

Updating mechanism:
Mark feature as deprecated in CHANGELOG.md when applicable, release minor version
Add support for new/replacing feature if necessary, add support for such replacing features in croissant2croissant converter (#109), release minor version.
Update datasets configs using Croissant2Croissant converter, release patch version
remove support for reading old feature
release major version (unless we are still in 0.x.y version).
Right before/After major release (or equivalent before 1.0.0): bump the croissant_version of dataset configs hosted on Croissant github

@ccl-core ccl-core moved this from Todo to In Progress in Croissant spec 1.0 Dec 1, 2023
@benjelloun
Copy link
Contributor

Can we mark this issue as resolved? I believe the documentation in the spec covers versioning of datasets:

https://docs.google.com/document/d/11E1x2rIKo_9C2Hh7pMpHtTE30iizVCWUMQ9rDysBoeA/edit?resourcekey=0-drT2urhsv5QnaBr57G0coQ&tab=t.0#heading=h.mcnm1a3kt8ci

and versioning of the Croissant format itself

https://docs.google.com/document/d/11E1x2rIKo_9C2Hh7pMpHtTE30iizVCWUMQ9rDysBoeA/edit?resourcekey=0-drT2urhsv5QnaBr57G0coQ&tab=t.0#heading=h.5qrizgdt7gey

Versioning the python library should be handled elsewhere in my opinion.

@pierrot0
Copy link
Contributor Author

Yes, resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
Development

No branches or pull requests

4 participants