Releases: awslabs/kubeflow-manifests
v1.7.0-aws-b1.0.3
What's Changed
- Terraform related changes
- Infrastructure and application deployment for terraform now use aws-eks-blueprints v4.32.1 in order to mitigate a breaking change caused by the aws-eks-blueprints github repository being refactored. (#776)
- Fix to pass variable mysql_engine_version to RDS module by @elanv in #781
- Increasing the default disk space for gpu instances and adding a variable to configure the disk size. (#782)
- Update rds engine version in terraform deployments. (#785)
- Migrate to aws-eks-blueprints v5 for Kubernetes Addon installation. For GPU deployments the Nvidia operator will be installed. (#779)
- Helm/Kustomize related changes
- Updated the AWS Load Balancer Controller’s Alb policy to include AddTag permissions in Create* operations in accordance with the new aws policy change. By @ananth102 in #777
New Contributors
Full Changelog: v1.7.0-aws-b1.0.2...v1.7.0-aws-b1.0.3
v1.7.0-aws-b1.0.2
What's Changed
- Fix Kubeflow on AWS terraform deployments failing due to Terraform aws-vpc module breaking changes
- The Terraform AWS provider recently deprecated resources and attributes related to EC2 Classic, resulting in breaking changes to aws-vpc module < 5.0.0 that were using the Terraform AWS provider > 5.0.0. This release updates the aws-vpc module and pins the version of the Terraform AWS provider to > 5 and < 6 to prevent such issues in future.
- Upgraded the terraform aws-vpc module by @amitkalawat in #751
- Pin Terraform AWS provider to 5.x.x by @ananth102 in #755
- Installation script fixes for cognito-rds-s3-static helm deployments by @sagi-shimoni in #749
New Contributors
- @amitkalawat made their first contribution in #751
- @sagi-shimoni made their first contribution in #749
Full Changelog: v1.7.0-aws-b1.0.0...v1.7.0-aws-b1.0.2
v1.7.0-aws-b1.0.1
What's Changed
- Fixed manifest issue with pipelines affecting all S3 deployments using static credentials by @jsitu777 #710, #716.
- If you are using
IRSA
asPIPELINE_S3_CREDENTIAL_OPTION
you are not affected by this issue.
- If you are using
- Added support for automated EFS deployment for Terraform deployments and update EKS blueprints version to v4.31.0 by @rrrkharse in #731
- Fix load balancer auto setup script if load balancer schema is
internal
by @rrrkharse in #732 - Documentation for creating additional profiles when using
IRSA
asPIPELINE_S3_CREDENTIAL_OPTION
by @ryansteakley #700, #722 - Documentation fixes for KServe, RDS-S3 guides #720, #728
Full Changelog: v1.7.0-aws-b1.0.0...v1.7.0-aws-b1.0.1
v1.7.0-aws-b1.0.0
What’s New
This release offers the following features:
- Added support for Kubeflow
v1.7.0
. Upstream Kubeflow components versions as listed in components versions table - Support IAM Role for Service Account (IRSA) for using Amazon S3 as artifact store for Kubeflow Pipelines
- IRSA can be used to configure Amazon S3 as an artifact store for pipelines. IRSA allows to use temporary credentials to make API requests and to scope permissions at pod level via Kubernetes service accounts. Instead of creating static IAM User credentials to access S3, using IRSA implements the security best practices of principle of least privilege and credential isolation. (#571, #601, #613, #680, #685)
- Starting this release, we are deprecating the use of IAM user/static credentials in favor of IRSA to configure S3 with Kubeflow pipelines. We highly recommend migrating to using IRSA. For more details about this change refer to the Github issue #704
- Configure Server side encryption and block public access to S3 bucket used by Kubeflow Pipelines by default as security best practice (#517, #518)
- Support using IRSA with KServe Inference Services. Use this feature to pull images from private ECR repository or load models directly from S3 bucket.
- Support for using Amazon S3 as an object store backend for TensorBoard. Users can now visualize TensorBoard compatible logs stored in S3 published by model servers and training jobs(including TrainingJobs run on SageMaker) to track experiment metrics like loss and accuracy, visualizing the model graph etc.
- Added ability to annotate the service account using
AWSIAMforServiceAccount
Plugin. Users can use this feature if their organizational policies restrict them from using profile controller for updating IAM policies.- Setting
annotateOnly
to true inAWSIAMforServiceAccount
Plugin will only annotate the service account in user profile and skip mutating the IAM Policy.
- Setting
- Support configuring Amazon S3 as a remote backend for storing Terraform state (#674)
- Support configuring auto stopping of idle Jupyter Notebook Servers
- Enabled support for Notebook Culling. Users can save infrastructure costs by specifying notebook instance to stop if it stays idle for certain period of time. (#470)
- Updated notebook containers with the latest AWS optimized Deep Learning Containers(DLC) based on
Tensorflow 2.12.0
andPyTorch 2.0.0
(#676) - Updated Training and Inference containers with the latest AWS optimized Deep Learning Containers(DLC) based on
Tensorflow 2.12
andPyTorch 2.0
. Support for CPU/GPU based single node training, distributed training, and inference. For latest DLC images, refer to list of DLC images - Updated the following drivers to newer versions:
- FSx CSI Driver to
v0.9.0
- EFS CSI Driver to
v1.5.4
- AWS Load Balancer Controller to
v2.4.7
- FSx CSI Driver to
- Updated SageMaker Operator for k8s (ACK) to
v1.2.1
- Training Job resource now supports Managed warm pool, heterogeneous clusters through Instance Groups and Retry Strategy
- Added support for SageMaker Pipeline and Pipeline Execution
- Training Job resource now supports Update Operations.
- Support for Deployment guard rails for Endpoint Resource.
- Support for Serverless Endpoint for Endpoint Config Resource.
- Support for retaining AWS resources after CR deletion.
- Supports latest versions of Amazon EKS - eks-compatibility
- Support for Kustomize
v5.0.1
- Bugfixes and improvements to the automated scripts
Updated documentation available at: https://awslabs.github.io/kubeflow-manifests/release-v1.7.0-aws-b1.0.0/docs/
Known Issues:
Full Changelog: release-v1.6.1-aws-b1.0.2...release-v1.7.0-aws-b1.0.0
v1.6.1-aws-b1.0.2
What's Changed
- Release v1.6.1 aws b1.0.2 by @ryansteakley in #666
- Update secrets-csi-driver to v1.3.2 by @ryansteakley in #651
- Update EKS Blueprints to v4.28.0 by @ryansteakley in #659
Known Issues
- #118 (Workaround documented in issue)
Full Changelog: v1.6.1-aws-b1.0.1...v1.6.1-aws-b1.0.2
v1.6.1-aws-b1.0.1
What's Changed
- Katib S3-only Helm Path fix by @jsitu777 in #507
- Update RDS engine version by @techwithshadab #584
- Update terraform-aws-blueprints to v4.12.1 by @ghaering #516
Known Issues
Update Existing Kubeflow Installations with RDS or S3 integrations
On February 6th, the Kubernetes project announced changes to the existing community-owned image registry called k8s.gcr.io
to host its container images. On the 3rd of April 2023, the old registry k8s.gcr.io
will be frozen and no further images for Kubernetes and related subprojects will be pushed to the old registry. The Kubernetes community recommends to start using the new registry.k8s.io
as soon as possible. For more information read the community blog.
Only the Secrets Store CSI Driver in the AWS Distribution of Kubeflow is effected. To update the image registry to point towards the new registry.k8s.io
please follow the instructions documented in the below github issue comment.
Full Changelog: v1.6.1-aws-b1.0.0...v1.6.1-aws-b1.0.1
v1.6.1-aws-b1.0.0
What’s New
This release offers the following features:
- Added support for Kubeflow
v1.6.1
.- Component versions as listed in components versions table
- Updated SageMaker operator for k8s (ACK) to version
0.4.5
- Updated notebook containers with the latest deep learning containers based on Tensorflow 2.10.0 and PyTorch 1.12.1 (#473)
- Includes all the features from v1.6.0-aws-b1.0.0 (Preview)
- Integration of SageMaker with Kubeflow to run hybrid machine learning workflows using SageMaker Operators for Kubernetes (ACK) and SageMaker Components for Kubeflow Pipelines. Documentation
- Added support for Infrastructure as Code (IaaC) 1-click deployment for Kubeflow on AWS using Terraform (preview)
- Added helm support for all supported deployment options (preview)
- Added integration with Prometheus, Amazon Managed Service for Prometheus, and Amazon Managed Grafana to monitor metrics with Kubeflow on AWS. Documentation
- Automated deployment options have been improved to be simplified and more stable (User-friendly
make
commands, Deterministic install/uninstall etc) - Integration with AWS Deep Learning Containers to run distributed training and inference workloads
- Supports newer versions of EKS - eks-compatibility
- Added Nvidia GPU support in Terraform (#396)
- Enabled KFP Visualizations and Artifact Store with S3 as source (#456)
- Bugfixes and improvements to the automated scripts
Updated documentation available at: https://awslabs.github.io/kubeflow-manifests/release-v1.6.1-aws-b1.0.0/docs/
Known Issues:
Update Existing Kubeflow Installations with RDS or S3 integrations
On February 6th, the Kubernetes project announced changes to the existing community-owned image registry called k8s.gcr.io
to host its container images. On the 3rd of April 2023, the old registry k8s.gcr.io
will be frozen and no further images for Kubernetes and related subprojects will be pushed to the old registry. The Kubernetes community recommends to start using the new registry.k8s.io
as soon as possible. For more information read the community blog.
Only the Secrets Store CSI Driver in the AWS Distribution of Kubeflow is effected. To update the image registry to point towards the new registry.k8s.io
please follow the instructions documented in the below github issue comment.
Full Changelog: release-v1.6.0-aws-b1.0.0...release-v1.6.1-aws-b1.0.0
v1.6.0-aws-b1.0.0 (Preview)
What’s New
This is a preview release for Kubeflow
v1.6
. The Kubeflow working groups have identified some regressions inv1.6.0
which will be addressed inv1.6.1
. More details can be found here.
This release offers the following features:
- Added support for Kubeflow
v1.6.0
. Component versions as listed in components versions table - Integration of SageMaker with Kubeflow to run hybrid machine learning workflows using SageMaker Operators for Kubernetes (ACK) and SageMaker Components for Kubeflow Pipelines. Documentation
- Added
helm
support for all supported deployment options - Automated deployment options have been improved to be simplified and more stable
- Added support for Infrastructure as Code (IaaC) 1-click deployment for Kubeflow on AWS using Terraform (preview)
- Terraform stacks added for all supported deployment options
- Creates a VPC and EKS Cluster
- Creates S3 buckets, RDS instances, and/or Cognito resources as needed
- Configures and deploys Kubeflow
- Configured using EKS Blueprints for improved customizability/extensability
- Terraform stacks added for all supported deployment options
- Configurable S3 endpoint configuration for S3 and RDS-S3 deployment options, allowing PrivateLink and non-commercial region users to connect to their respective S3 endpoints
- Added integration with Prometheus, Amazon Managed Service for Prometheus, and Amazon Managed Grafana to monitor metrics with Kubeflow on AWS. Documentation
- Updated notebook containers with the latest deep learning containers based on Tensorflow
2.9.1
and PyTorch1.12
(#363) - Integration with AWS Deep Learning Containers to run distributed training and inference workloads
- Enable usage of HTTPs only S3 bucket (#335)
- Support for EKS - 1.22, 1.23
This release includes the following bug fixes:
- Re-enable mysql for s3-only pipelines deployment (#310)
Updated documentation available at: https://awslabs.github.io/kubeflow-manifests/release-v1.6.0-aws-b1.0.0/docs/
Known Issues
- #117 (Workaround documented in issue)
- #118 (Workaround documented in issue)
- Following known issues will be fixed in next release:
Full Changelog: release-v1.5.1-aws-b1.0.2...release-v1.6.0-aws-b1.0.0
v1.5.1-aws-b1.0.2
What’s New
This release includes the following bug fixes merged as part of #373:
- Fix S3 bucket name substitution for all S3 related deployments, i.e. Cognito-RDS-S3, RDS-S3, S3 (#333)
- See #336 for more details about the issue. This bug was introduced in v1.5.1-aws-b1.0.1.
- Fix the missing mysql resources in S3 only deployment (#310)
- Enable usage of HTTPs only S3 bucket (#244)
- Fix for RDS-S3 test (#341)
Updated documentation available at: https://awslabs.github.io/kubeflow-manifests/release-v1.5.1-aws-b1.0.2/docs/
Known Issues
- #117 (Workaround documented in issue)
- #118 (Workaround documented in issue)
- kubeflow/pipelines#7361 (Terminating the pipeline run does not trigger the deletion logic programmed via the signal handled in a component. This affects all components in general. Terminate functionality in SageMaker components for Kubeflow pipelines is also affected. Workaround is to manually stop the training jobs)
Full Changelog: v1.5.1-aws-b1.0.1...v1.5.1-aws-b1.0.2
v1.5.1-aws-b1.0.1
What’s New
This release includes the following bug fixes:
- fix Kserve's ingress Gateway (#311)
- Add support for non-root EFS files ownership( #268)
- Hardcoded S3 endpoint url in workflow controller configmap (#257)
- Add CDK created EKS cluster subnet tags for RDS script (#295)
- Doc fixes: #304, #307
Updated documentation available at: https://awslabs.github.io/kubeflow-manifests/release-v1.5.1-aws-b1.0.1/docs/
Known Issues
- #117 (Workaround documented in issue)
- #118 (Workaround documented in issue)
- kubeflow/pipelines#7361 (Terminating the pipeline run does not trigger the deletion logic programmed via the signal handled in a component. This affects all components in general. Terminate functionality in SageMaker components for Kubeflow pipelines is also affected. Workaround is to manually stop the training jobs)
Full Changelog: v1.5.1-aws-b1.0.0...v1.5.1-aws-b1.0.1