Releases: kubernetes-sigs/jobset
Releases · kubernetes-sigs/jobset
Release v0.7.2
What's Changed
- Update docs for v0.7.0 (release branch) by @danielvegamyhre in #691
- Automated cherry pick of #705: Propagate schedulingGates set on PodTemplate when resuming by @mimowo in #706
Full Changelog: v0.7.0...v0.7.2
Release v0.7.1
What's Changed
- Update docs for v0.7.0 (release branch) by @danielvegamyhre in #691
- Automated cherry pick of #705: Propagate schedulingGates set on PodTemplate when resuming by @mimowo in #706
Full Changelog: v0.7.0...v0.7.1
v0.7.0
Highlights
- Add restart strategy by @nstogner in #686
- Priority-based exclusive placement by @ahg-g in #687
- feat: add component config by @rainfd in #609
What's Changed
- fix: delete active jobs right away when job finishes even when TTLSecondsAfterFinished is set by @CecileRobertMichon in #667
- Bump github.com/onsi/ginkgo/v2 from 2.20.0 to 2.20.1 by @dependabot in #663
- Bump github.com/prometheus/client_golang from 1.20.0 to 1.20.2 by @dependabot in #664
- Bump kubernetes dependencies to v0.31.x. by @mbobrovskyi in #670
- Bump github.com/onsi/ginkgo/v2 from 2.20.1 to 2.20.2 by @dependabot in #668
- Bump github.com/onsi/gomega from 1.34.1 to 1.34.2 by @dependabot in #669
- chore: update README.md e2e test version for v1.31.0 by @googs1025 in #671
- Add test-python-sdk on Makefile test. by @mbobrovskyi in #673
- Bump github.com/prometheus/client_golang from 1.20.2 to 1.20.3 by @dependabot in #674
- feat: add component config by @rainfd in #609
- Bump the kubernetes group with 6 updates by @dependabot in #675
- Add global-job-replicas label/annotation by @GiuseppeTT in #677
- Add examples for three existing failure policy actions. by @jedwins1998 in #601
- Bump github.com/prometheus/client_golang from 1.20.3 to 1.20.4 by @dependabot in #679
- chore: use symbolic link instead of directory by @googs1025 in #630
- Priority-based exclusive placement by @ahg-g in #687
- Bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 by @dependabot in #688
- Add restart strategy by @nstogner in #686
New Contributors
- @CecileRobertMichon made their first contribution in #667
- @rainfd made their first contribution in #609
- @GiuseppeTT made their first contribution in #677
- @nstogner made their first contribution in #686
Full Changelog: v0.7.0-devel...v0.7.0
v0.6.0
Highlights
- New JobSet Failure Policy API - allows users to configure different behavior for different types of errors, enabling them to use compute resources more efficiently and improve ML training goodput.
- Add Coordinator field to JobSet spec, enabling user to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this pod will be added as a label and annotation to every Job and Pod in the JobSet for easy use in application code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
- Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
- Added new metrics
- Improved test coverage
- Bug fixes
- New examples and documentation
What's Changed
- feat: add e2e test for ttl seconds after finished in jobset by @dejanzele in #511
- add publish not ready headless service to jobset by @kannon92 in #505
- use kube-openapi rather than code generator openapi-gen by @kannon92 in #522
- Allow passing args to ginkgo for integration tests by @danielvegamyhre in #525
- Refactor create jobs by @danielvegamyhre in #516
- Do not default the managedBy field by @mimowo in #528
- feat: add event recorder event by @googs1025 in #507
- use t.Errorf instead of t.Fatalf by @googs1025 in #532
- Fix path for the error when attempting to mutate managedBy by @mimowo in #527
- Fix bug when checking if a JobSet is active during tests. by @jedwins1998 in #531
- Correct typo in configurable failure policy KEP. by @jedwins1998 in #539
- fix: fix ci error caused by typo by @googs1025 in #544
- Bump the kubernetes group with 4 updates by @dependabot in #542
- Bump github.com/onsi/gomega from 1.32.0 to 1.33.0 by @dependabot in #543
- docs: fix site url not found by @googs1025 in #541
- use hugo param to define variables in md language by @googs1025 in #540
- add unit tests for createHeadlessSvcIfNecessary by @dejanzele in #526
- test: add pod controller unit test by @googs1025 in #490
- Add comment explaining why we don't unconditionally compute firstFailedJob by @danielvegamyhre in #549
- Bump github.com/onsi/ginkgo/v2 from 2.17.1 to 2.17.2 by @dependabot in #552
- Track which features in roadmap have been released by @danielvegamyhre in #554
- docs: using kustomize for adjusting resources by @omerap12 in #558
- Bump github.com/onsi/gomega from 1.33.0 to 1.33.1 by @dependabot in #560
- Don't reconcile JobSets with deletion timestamp set by @danielvegamyhre in #562
- Improve the API generated docs for managedBy by @mimowo in #565
- chore: Upgrade e2e local image by @googs1025 in #567
- Bump github.com/onsi/ginkgo/v2 from 2.17.2 to 2.17.3 by @dependabot in #569
- Add support for feature gates by @googs1025 in #557
- Implement configurable failure policy. by @jedwins1998 in #537
- Update the JobSet version to 0.5.1 for installation by @mimowo in #577
- Bump github.com/onsi/ginkgo/v2 from 2.17.3 to 2.19.0 by @dependabot in #581
- Relax validation on ReplicatedJob PodTemplates of suspended JobSets by @danielvegamyhre in #580
- update makefile kind version to v1.30.0 by @googs1025 in #589
- Propagate Job pod template updates to suspended jobs when resuming by @danielvegamyhre in #590
- docs: update to v0.5.2 by @googs1025 in #593
- fix: fix log to avoid panic by @googs1025 in #595
- avoid log panic by @googs1025 in #598
- Add omitempty to annotation of OnJobFailureReasons. by @jedwins1998 in #596
- update readme docs e2e test version to v1.30 by @googs1025 in #602
- Update _index.md
MASTER_ADDR
by @song-william in #604 - Add client-go example by @danielvegamyhre in #606
- Wait for the webhook service to be listening before advertising the Jobset replica as ready. by @mbobrovskyi in #608
- docs: add simple example for network field by @googs1025 in #550
- feat: add terminalState to jobset status by @googs1025 in #594
- Integration test improvement: rename "update" to "step" by @danielvegamyhre in #610
- docs: add argo workflow example for jobset by @googs1025 in #612
- docs: add JobSet API reference by @googs1025 in #611
- docs: fix typo, Github -> GitHub by @highpon in #615
- Allow mutating schedulingGates when the Jobset is suspended by @mimowo in #623
- Add Coordinator field to JobSet spec by @danielvegamyhre in #618
- Validation for Coordinator field by @danielvegamyhre in #627
- Add example for coordinator by @danielvegamyhre in #628
- docs: add prometheus-operator example for jobset by @googs1025 in #629
- Bump github.com/onsi/gomega from 1.33.1 to 1.34.0 by @dependabot in #631
- Bump github.com/onsi/ginkgo/v2 from 2.19.0 to 2.19.1 by @dependabot in #632
- feat: add metrics for jobset by @googs1025 in #614
- docs: update metrics info for site by @googs1025 in #633
- chore: add github issue, pr template by @googs1025 in #634
- Bump github.com/onsi/gomega from 1.34.0 to 1.34.1 by @dependabot in #638
- fix error output by @googs1025 in #636
- Bump k8s dependencies to 1.30 dependencies and modify update-codegen.sh to be compatible with new code-generator by @danielvegamyhre in #641
- Fix bug in replicatedJobByName by @danielvegamyhre in #645
- Allow to update JobSets on suspend by @mimowo in #644
- Refactor jobset webhook by @danielvegamyhre in #646
- add the unparam linter to golangci and fix those issues flagged by @kannon92 in #643
- drop job-name from labels as it is not used by @kannon92 in #642
- Bump github.com/onsi/ginkgo/v2 from 2.19.1 to 2.20.0 by @dependabot in #647
- Add new job-id annotation to assign globally unique job index to each job by @danielvegamyhre in #650
- Bump github.com/prometheus/client_golang from 1.19.1 to 1.20.0 by @dependabot in #653
- update to k8s 0.30.4 by @kannon92 in #654
New Contributors
- @mimowo made their first contribution in #528
- @omerap12 made their first contribution in #558
- @song-william made their first contribution in #604
- @mbobrovskyi made their first contribution in #608
- @highpon made their first contribution in #615
Full Changelog: v0.6.0-devel...v0.6.0
JobSet v0.5.2
What's Changed
- Automated cherry pick of #580: relax validation on replicated jobs by @danielvegamyhre in #584
- Automated cherry pick of #590: propagate job pod template updates to suspended jobs when by @danielvegamyhre in #591
Full Changelog: v0.5.1...v0.5.2
v0.5.1
Highlights
- Fixed bug causing foreground cascading deletion policy to not work properly on JobSets #562
- Fixed field path in error message in validation for ManagedBy field #527
- Test coverage improvements, refactoring, additional documentation
What's Changed
- Update docs for 0.5.0 by @danielvegamyhre in #517
- [Release-0.5] Do not default the managedBy field by @kannon92 in #533
- Automated cherry pick of #527: Fix path for the error when mutating managedBy by @kannon92 in #534
- Automated cherry pick of #562: don't reconcile jobsets with deletion timestamp set by @danielvegamyhre in #564
Full Changelog: v0.6.0-devel...v0.5.1
v0.5.0
What's Changed
Highlights
- JobSet TTL support added in #443
- Docsite is live at https://jobset.sigs.k8s.io/ with updated documentation and examples.
- Include first failed job name in event emitted when JobSet fails, to speed up the debugging process for large complex workloads #477
- Lower default resource request for JobSet controller manager so it fits on default cloud CPU VMs, but keep high limit to support maximum performance #480
- Perform only 1 JobSet status update per reconcile attempt to reduce pressure on k8s apiserver #494
- Introduced MangedBy field to the JobSet spec to enable Multi-Kueue support
Detailed release notes
- Add info to landing page by @danielvegamyhre in #435
- Validate follower pod owned by same Job as leader pod by @danielvegamyhre in #433
- Bump github.com/stretchr/testify from 1.8.4 to 1.9.0 by @dependabot in #439
- Add descriptions to ReplicatedJobStatus fields by @danielvegamyhre in #442
- Bump github.com/onsi/ginkgo/v2 from 2.15.0 to 2.16.0 by @dependabot in #444
- Add JobSet diagram and other doc updates by @danielvegamyhre in #446
- Update installation version to latest release in public docs by @danielvegamyhre in #450
- add concept image by @moficodes in #454
- Update tasks documentation by @danielvegamyhre in #453
- Emit Job creation failed event by @danielvegamyhre in #448
- Remove Jobset Docs from root by @moficodes in #455
- Fix 404 error when clicking on driver-worker-success-policy.yaml by @kannon92 in #456
- Rename FAQ to troubleshooting on docsite by @danielvegamyhre in #457
- Bump the kubernetes group with 4 updates by @dependabot in #459
- Add features overview to README by @danielvegamyhre in #452
- Update Makefile rules to use more specific paths by @danielvegamyhre in #470
- Fix typo in readme by @danielvegamyhre in #472
- Add jobset roadmap to README by @danielvegamyhre in #468
- Bump github.com/onsi/gomega from 1.31.1 to 1.32.0 by @dependabot in #475
- Bump github.com/onsi/ginkgo/v2 from 2.16.0 to 2.17.1 by @dependabot in #474
- update golang to 1.22 by @kannon92 in #471
- Lower default resource request for controller manager but keep high limit by @danielvegamyhre in #480
- Include first failed job name in event emitted when JobSet fails, as well as the JobSet failure condition by @danielvegamyhre in #477
- Update README.md to correct concepts link by @jtorrex in #486
- Code cleanup and refactoring by @danielvegamyhre in #484
- Move headless service creation outside of createJobs by @danielvegamyhre in #483
- Remove Duplicate Import by @jedwins1998 in #488
- Introduce
managedBy
field and Removemanaged-by
label by @jedwins1998 in #487 - fix some typo error by @googs1025 in #489
- Move JobSet webhook into same webhooks package as pod webhook by @danielvegamyhre in #460
- add unit test for jobset webhook updates by @kannon92 in #464
- feat: add support for ttl cleanup for finished jobsets by @dejanzele in #443
- Add unit tests to jobset success policy functions by @zhifei92 in #501
- fix: add IsNotFoundErr when get headlessSvc by @googs1025 in #503
- Update envtest and add back crd generation when updating the api by @kannon92 in #510
- Call Status.Update once in each reconcile attempt by @danielvegamyhre in #494
- Clean up outdated comments by @danielvegamyhre in #512
- Bump sigs.k8s.io/controller-runtime from 0.17.2 to 0.17.3 in the kubernetes group by @dependabot in #513
- Update docs for 0.5.0 by @danielvegamyhre in #517
New Contributors
- @jtorrex made their first contribution in #486
- @jedwins1998 made their first contribution in #488
- @zhifei92 made their first contribution in #501
Full Changelog: v0.5.0-devel...v0.5.0
v0.4.0
What's Changed
- Update main branch installation docs for release v0.3.0 by @danielvegamyhre in #349
- use kind export logs by @kannon92 in #352
- add suspend to replicated job status by @kannon92 in #250
- Update the installation docs to mention the CPU nodes minimum necessary CPU/memory resources by @danielvegamyhre in #354
- Use jobset-system instead of kind-system for jobset by @kannon92 in #358
- A KEP for StartupPolicy by @kannon92 in #244
- Add patches for Kustomize to add objectSelectors to pod webhook configurations by @danielvegamyhre in #362
- Update installation docs for v0.3.1 [main] by @danielvegamyhre in #368
- Bump k8s.io/apimachinery from 0.28.4 to 0.28.5 by @dependabot in #369
- Bump github.com/open-policy-agent/cert-controller from 0.10.0 to 0.10.1 by @dependabot in #373
- Bump k8s.io/api from 0.28.4 to 0.28.5 by @dependabot in #370
- Bump k8s.io/code-generator from 0.28.3 to 0.28.5 by @dependabot in #371
- Bump k8s.io/client-go from 0.28.4 to 0.28.5 by @dependabot in #372
- Bump github.com/onsi/ginkgo/v2 from 2.13.2 to 2.14.0 by @dependabot in #376
- update kind to 0.20.0 by @kannon92 in #359
- Bump k8s.io/code-generator from 0.28.5 to 0.28.6 by @dependabot in #382
- Bump github.com/onsi/gomega from 1.30.0 to 1.31.1 by @dependabot in #383
- Bump k8s.io/client-go from 0.28.5 to 0.28.6 by @dependabot in #384
- upgrade kubernetes apis to 0.29 by @kannon92 in #387
- Move exclusive placement annotation to ReplicatedJob template by @danielvegamyhre in #389
- add dependabot groups for k8s packages by @kannon92 in #391
- add a message to events by @kannon92 in #390
- Migrate from background to foreground cascading deletion policy by @danielvegamyhre in #393
- Default service name in JobSet controller by @danielvegamyhre in #395
- bumping controller tools to see if this fixes ci by @kannon92 in #403
- add suspend field to printcolumn by @kannon92 in #400
- add jobset docsite by @moficodes in #402
- KEP 262: Configurable Failure Policy API by @danielvegamyhre in #381
- Get subdomain via a func instead of defaulting it on the jobset object by @ahg-g in #404
- Bump the kubernetes group with 1 update by @dependabot in #406
- Startup policy implementation by @kannon92 in #246
- Minor cleanup to ensureConditionOpts by @ahg-g in #410
- Validate longest pod name for jobset will not exceed 63 chars by @danielvegamyhre in #409
- Add managed-by label support. by @trasc in #407
- Improve error messages and logging in webhooks by @danielvegamyhre in #421
- Update installation docs for v0.3.2 by @danielvegamyhre in #424
- typo: Fix some comments by @googs1025 in #426
- Bump the kubernetes group with 5 updates by @dependabot in #431
- Update docsite title and subtitle by @danielvegamyhre in #432
New Contributors
- @moficodes made their first contribution in #402
- @trasc made their first contribution in #407
- @googs1025 made their first contribution in #426
Full Changelog: v0.4.0-devel...v0.4.0
JobSet v0.3.2
What's Changed
- Automated cherry pick of #403: bumping controller tools to see if this fixes ci by @danielvegamyhre in #416
- Automated cherry pick of #393: add backoff for creation by @danielvegamyhre in #417
- Automated cherry pick of #395: default service name in controller by @danielvegamyhre in #418
- Automated cherry pick of #404: Get subdomain via a func instead of defaulting it on the by @danielvegamyhre in #419
- Automated cherry pick of #409: validate longest pod name for jobset will not exceed 63 chars by @danielvegamyhre in #420
- Automated cherry pick of #421: add clearer error message for pod name too long by @danielvegamyhre in #423
Full Changelog: v0.3.1...v0.3.2
JobSet v0.3.1
What's Changed
- Automated cherry pick of #362: add webhook patches by @danielvegamyhre in #365
Full Changelog: v0.3.0...v0.3.1