# Changelog ## v4.9.0 (2025-02-11) ### Features * Add Code Owners file ## v4.8.4 (2025-02-03) ### Bug Fixes and Other Changes * account for possible race condition when creating /opt/ml/code ## v4.8.3 (2024-12-09) ### Bug Fixes and Other Changes * resolve failing unit test * avoid parsing stderr as JSON ## v4.8.2 (2024-12-06) ### Bug Fixes and Other Changes * temporarily hardcode neuron cores for trn2 ## v4.8.1 (2024-09-09) ### Bug Fixes and Other Changes * Added p5 as a supported NCCL instance ## v4.8.0 (2024-08-14) ### Features * Add support for py39 and py310 ### Bug Fixes and Other Changes * typo in the run unit tests command * run unit tests in sequence order for release process as well to prevent coverage conflicting issues * chore: removing unnecessary logging information ## v4.7.4 (2023-10-31) ### Bug Fixes and Other Changes * update the boto deps to use latest boto ## v4.7.3 (2023-10-23) ### Bug Fixes and Other Changes * bypass DNS check for studio local exec ## v4.7.2 (2023-10-19) ### Bug Fixes and Other Changes * use smddprun only if it is installed ## v4.7.1 (2023-10-17) ### Bug Fixes and Other Changes * Add NCCL_PROTO=simple environment variable to handle the out-of-order data delivery from EFA * toolkit build failure ## v4.7.0 (2023-08-08) ### Features * support codeartifact for installing requirements.txt packages ## v4.6.1 (2023-06-19) ### Bug Fixes and Other Changes * removed unused import statment * forgot to run black on torch_distributed.py after updating my comments from last commit * Modified my comment on line 98-103 in torch_distrbuted.py to comply with formatting standard. * Revert "Ran black on entire sagemaker-trianing-toolkit directory" * Ran black on entire sagemaker-trianing-toolkit directory * Ran Black (python formatter) on the files with my code updates (torch_distributed.py and test_torch_distributed.py) * Added test for neuron_parallel_compile in test_torch_distributed.py * Updated comment syntax based on feedback in pull request as well as added full example of the neuron_parallel_compile command as it would appear in the command line * added unit test for neuron_parallel_compile code change * Updated torch_distributed.py ## v4.6.0 (2023-06-15) ### Features * add smddp exception classes in mpi distribution ## v4.5.0 (2023-04-26) ### Features * add NCCL_PROTO, NCCL_ALGO environments for modelparallel jobs ## v4.4.10 (2023-04-10) ### Bug Fixes and Other Changes * unpin sagemaker version as the credential issue fixed ## v4.4.9 (2023-04-05) ### Bug Fixes and Other Changes * increase worker waiting time for ORTE proc ## v4.4.8 (2023-03-09) ### Bug Fixes and Other Changes * upagrade protobuf version for tensorflow 2.12 ## v4.4.7 (2023-03-02) ### Bug Fixes and Other Changes * Revert SMDDP collectives feature from smdataparallel runner ## v4.4.6 (2023-02-22) ## v4.4.5 (2023-01-24) ## v4.4.4 (2023-01-23) ### Bug Fixes and Other Changes * Update libraries for SMDDP collectives validation ## v4.4.3 (2023-01-18) ### Bug Fixes and Other Changes * Upgrade protobuf to prevent conflicts with smdebugger. ## v4.4.2 (2023-01-16) ## v4.4.1 (2022-12-13) ### Bug Fixes and Other Changes * Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances. ## v4.4.0 (2022-12-06) ### Features * integrate SMDDP collectives into smdataparallel runner ## v4.3.2 (2022-11-29) ### Bug Fixes and Other Changes * add general exception to filter ## v4.3.1 (2022-10-27) ### Bug Fixes and Other Changes * integrate upcoming dataparallel change to modelparallel * add unit tests for torchrun launcher and collections package deprecationWarning ## v4.3.0 (2022-10-20) ### Features * Add torch_distributed support for Trainium instances in SageMaker ## v4.2.10 (2022-10-17) ### Bug Fixes and Other Changes * * feature: Add neuron cores support (#21) ## v4.2.9 (2022-09-26) ### Bug Fixes and Other Changes * Add SageMaker Debugger exceptions ## v4.2.8 (2022-09-12) ## v4.2.7 (2022-09-10) ### Bug Fixes and Other Changes * improve worker node wait logic and update EFA flags ## v4.2.6 (2022-08-18) ### Bug Fixes and Other Changes * Enable PT XLA distributed training on homogeneous clusters ## v4.2.5 (2022-08-17) ### Bug Fixes and Other Changes * relax exception type ## v4.2.4 (2022-08-15) ## v4.2.3 (2022-08-11) ### Bug Fixes and Other Changes * update num_processes_per_host for smdataparallel runner ## v4.2.2 (2022-08-10) ### Bug Fixes and Other Changes * Removed version hardcoding for sagemaker test dependency * update distribution_instance_group for pytorch ddp * specify flake8 config explicitly ## v4.2.1 (2022-07-29) ### Bug Fixes and Other Changes * handle utf-8 decoding exceptions while processing stdout and stderr streams ## v4.2.0 (2022-07-08) ### Features * Heterogeneous cluster changes ## v4.1.6 (2022-06-28) ### Bug Fixes and Other Changes * update: protobuf version to overlap with TF requirements ## v4.1.5 (2022-06-22) ### Bug Fixes and Other Changes * Fix none exception class issue for mpi ## v4.1.4 (2022-06-10) ### Bug Fixes and Other Changes * Use framework provided error class and stack trace as error message ## v4.1.3 (2022-06-03) ## v4.1.2 (2022-05-25) ### Bug Fixes and Other Changes * fix flaky issue with incorrect rc being given ## v4.1.1 (2022-04-27) ### Bug Fixes and Other Changes * missing args when shell script is used ## v4.1.0 (2022-04-05) ### Features * add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22 ## v4.0.1 (2022-01-29) ## v4.0.0 (2021-10-08) ### Breaking Changes * Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 (changes from PR #108) ## v3.9.3 ~ 4.0.0 (2021-10-07) ## Breaking Changes * Added `py38`, Removed `py36` and `py27` support ### Bug Fixes and Other Changes * Use asyncio to read stdout and stderr streams in realtime * Fix delayed logging issues * Convey user informative message if process gets OOM Killed * Filter out stderr to look for error messages and report * Report Exit code on training job failures * Prepend tags to MPI logs to enable easy filtering in CloudWatch * All the changes are from PR #108 ### Documentation Changes * Update SM doc urls * Update Amazon Licensing ### Testing and Release Infrastructure * Install libssl1.1 and openssl packages in Dockerfiles * Added `asyncio` package * Updated tests to use `asyncio` package ## v3.9.2 (2021-04-27) ### Bug Fixes and Other Changes * Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Dataloaders for Distributed training ## v3.9.1 (2021-04-13) ### Bug Fixes and Other Changes * [smdataparallel] better messages to establish the SSH connection between workers ## v3.9.0 (2021-04-07) ### Features * smdataparallel enable EFA RDMA flag ## v3.8.0 (2021-04-05) ### Features * smdataparallel custom mpi options support ## v3.7.5 (2021-03-30) ## v3.7.4 (2021-03-29) ### Bug Fixes and Other Changes * Update Dockerfile to accomomdate Rust dependency. ## v3.7.3 (2021-02-02) ### Bug Fixes and Other Changes * set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages ## v3.7.2 (2020-12-18) ### Bug Fixes and Other Changes * set btl_vader_single_copy_mechanism to none ## v3.7.1 (2020-12-17) ### Bug Fixes and Other Changes * decode binary stderr string before dumping it out ## v3.7.0 (2020-12-09) ### Features * add data parallelism support (#3) ### Bug Fixes and Other Changes * update tox to use sagemaker 2.18.0 for tests * use format in place of f-strings and use comment style type annotations ## v3.6.4 (2020-12-08) ### Bug Fixes and Other Changes * workaround to print stderr when capturing ### Testing and Release Infrastructure * use ECR-hosted image for ubuntu:16.04 ## v3.6.3.post0 (2020-11-11) ### Documentation Changes * fix typo in ENVIRONMENT_VARIABLES.md ## v3.6.3 (2020-10-26) ### Bug Fixes and Other Changes * propagate log level to aws services ## v3.6.2 (2020-08-04) ### Bug Fixes and Other Changes * check for script entry point even if setup.py is present ## v3.6.1.post1 (2020-08-03) ### Testing and Release Infrastructure * pin sagemaker<2 in test dependencies ## v3.6.1.post0 (2020-07-23) ### Documentation Changes * remove unofficially-supported environment variable ## v3.6.1 (2020-07-10) ### Bug Fixes and Other Changes * use '-bind-to none' flag to improve performance. ## v3.6.0 (2020-06-29) ### Features * persist env vars in /etc/environment for MPI processes ## v3.5.2.post0 (2020-06-29) ### Testing and Release Infrastructure * clarify feature request issue template ## v3.5.2 (2020-06-03) ### Bug Fixes and Other Changes * run Python script entry point as script and install from requirements.txt ## v3.5.1.post0 (2020-05-14) ### Documentation Changes * clean up README usage examples ## v3.5.1 (2020-05-11) ### Bug Fixes and Other Changes * Remove typing ## v3.5.0.post0 (2020-04-29) ### Testing and Release Infrastructure * Test against Python 3.7 in PR builds ## v3.5.0 (2020-04-27) ### Features * Add Python 3.7 support ## v3.4.2 (2020-04-21) ### Bug Fixes and Other Changes * Remove unused config files ### Documentation Changes * clean up README and other documentation ## v3.4.1 (2020-04-20) ### Bug Fixes and Other Changes * Remove etc directory ### Testing and Release Infrastructure * Add requirements.txt integration test in dummy container ## v3.4.0 (2020-04-16) ### Deprecations and Removals * Remove modules.download_and_install ### Bug Fixes and Other Changes * Refactor env * Refactor entry_point ### Documentation Changes * Update and add docstrings ### Testing and Release Infrastructure * Update GitHub issue and pull request templates ## v3.3.2 (2020-04-08) ### Bug Fixes and Other Changes * Refactor modules and entry_point (first pass) ## v3.3.1 (2020-04-06) ### Bug Fixes and Other Changes * Revert "change: stream stderr even when capture_error is True" * Use shlex.quote to construct bash command * Relax dependencies version requirements * Extract module to correct location in download_and_install * Upgrade psutil ### Testing and Release Infrastructure * Fix cleanup with requirements.txt functional tests * create __init__.py file for Python 2 import of protobuf during tests (#260) * Mark intermediate_output functional tests as xfail if not run on Linux ## v3.3.0 (2020-02-25) ### Deprecations and Removals * Remove serving CLI entry point ### Bug Fixes and Other Changes * Pin inotify-simple version ## v3.2.0 (2020-02-17) ### Deprecations and Removals * Remove legacy serving stack ### Features * Support specifying S3 endpoint URL ### Bug Fixes and Other Changes * Fix memory leak in gethostname and adapt len semantics to Posix ## v3.1.0 (2020-02-13) ### Deprecations and Removals * Remove beta directory ## v3.0.0 (2020-02-11) ### Breaking Changes * rename package from sagemaker_containers to sagemaker_training_toolkit ### Bug Fixes and Other Changes * modify download_and_install to work with local tarball * change scipy version pin to lower bound ## v2.6.2 (2019-12-18) ### Bug fixes and other changes * Add `scipy` to requried packages ## v2.6.1 (2019-11-30) ### Bug fixes and other changes * bug-fix: array_to_recordio_protobuf should return byte buffer instead of Stream * bug-fix: Typo in the execution-parameters routing rule ## v2.6.0 (2019-11-25) ### Features * adding support for execution_parameters endpoint for serving ## v2.5.12 (2019-11-15) ### Bug fixes and other changes * Adding support for encoding to recordio ## v2.5.11 (2019-10-29) ### Bug fixes and other changes * stream stderr even when capture_error is True ## v2.5.10 (2019-10-24) ### Bug fixes and other changes * use built-in csv library in csv encoding/decoding for correct quoted string handling. ## v2.5.9 (2019-09-25) ### Bug fixes and other changes * Patch os.path.exists for sshd ## v2.5.8 (2019-09-24) ### Bug fixes and other changes * Mark gethostname tests as xfail if run locally ## v2.5.7 (2019-09-23) ### Bug fixes and other changes * Add Pylint to development process ## v2.5.6 (2019-09-19) ### Bug fixes and other changes * Use copy when installing user module from local path * Integrate black into development process ## v2.5.5 (2019-07-31) ### Bug fixes and other changes * Update setup.py ## v2.5.4 (2019-07-30) ### Bug fixes and other changes * install user module before GUnicorn starts * include /opt/ml/code to GUnicorn PYTHONPATH ## v2.5.3 (2019-07-22) ### Bug fixes and other changes * ensure exit code is an int ## v2.5.2 (2019-07-18) ### Bug fixes and other changes * pin flake and werkzeug versions * add GPU default for MPI processes per host ### Documentation changes * fix env var in readme ## v2.5.1 (2019-06-27) ### Bug fixes and other changes * Added execution-parameters to nginx.conf.template ## v2.5.0 (2019-06-24) ### Features * entrypoint run waits for hostname resolution ## v2.4.10.post0 (2019-05-29) ### Documentation changes * fix path for training script location ## v2.4.10 (2019-05-20) ### Bug fixes and other changes * Detailed documentation for SageMaker Containers - training * download_and_extract local tar file ## v2.4.9 (2019-05-08) ### Bug fixes and other changes * add test for network isolation mode training * remove unnecessary name argument from download and extract function ## v2.4.8 (2019-05-02) ### Bug fixes and other changes * use mpi4py in MPI command for Python executables ## v2.4.7 (2019-04-30) ### Bug fixes and other changes * allow MPI options to be passed through entry_point.run ## v2.4.6.post0 (2019-04-24) ### Documentation changes * add commit message format to CONTRIBUTING.md and PR template ## v2.4.6 (2019-04-23) ### Bug fixes and other changes * update for automated releases ## v2.4.5 * bug-fix: use specified args, entry point, and env vars when creating a runner ## v2.4.4.post2 * doc-fix: Convert README to RST * doc-fix: Update README with newer frameworks using SageMaker Containers ## v2.4.4.post1 * Specify ``long_description_content_type`` in setup ## v2.4.4 * bug-fix: correctly set NGINX_PROXY_READ_TIMEOUT to match model_sever_timeout. * enhancement: remove numpy version restriction. ## v2.4.3 * bug-fix: Fix recursive directory navigation in intermediate output. ## v2.4.2 * bug-fix: Rename libchangehostname to gethostname to match POSIX function name ## v2.4.1 * feature: C extension reads hostname from resourceconfig instead of env var. ## v2.4.0 * feature: Generic OpenMPI support * bug-fix: Fix response content_type handling ## v2.3.5 * bug-fix: Accept header ANY ('*/*') fallback to default accept * feature: Add intermediate output to S3 during training * bug-fix: reintroduce ``_modules.s3_download`` and ``_modules.download_and_install`` for backward compatibility ## v2.3.4 * feature: add capture_error flag to process.check_error and process.create and to all functions that runs process: modules.run, modules.run_module, and entry_point.run ## v2.3.3 * bug-fix: reintroduce _modules.prepare to import_module ## v2.3.2 * bug-fix: reintroduce _modules.prepare for backwards compatibility ## v2.3.1 * [breaking change] remove ``_modules.prepare`` and ``_modules.download_and_install`` * [breaking change] move ``_modules.s3_download`` to ``_files.s3_download`` * feature: support for Bash commands and Python scripts ## v2.3.0 * feature: Allow for dynamic nginx.conf creation * feature: Provide support for additional environment variables. (http_port, safe_port_range and accept) ## v2.2.7 * feature: Making pip install less noisy * bug-fix: Stream stderr instead of capturing it when running user script ## v2.2.6 * feature: Make it optional for run_module method to wait for the subprocess to exit * feature: Allow additional sagemaker hyperparameters to be stored in TrainingEnv ## v2.2.5 * feature: Transformer: support user-supplied ``transform_fn`` ## v2.2.4 * bug-fix: remove request size limit correctly ## v2.2.3 * enhancement: remove request size limit ## v2.2.2 * bug-fix: Fix choosing region for S3 client ## v2.2.1 * bug-fix: Use regional endpoint for S3 clients ## v2.2.0 * [breaking change] Remove ``status_codes`` module and use ``six.moves.http_client`` instead * [breaking change] Move ``UnsupportedFormatError`` from ``encoders`` module to ``errors`` module * Return 4XX status codes for ``UnsupportedFormatError`` from default input/output handlers ## v2.1.0 * Allow for local modules to work with AWS SageMaker framework containers. * Support for training outside of AWS SageMaker Training. ## v2.0.4 * Fix output_data_dir to reference an existing directory. * Fix error message. * Make pip install verbose. ## v2.0.3 * Fix error class for user script errors. * Adding Readme. ## v2.0.2 * Improve logging * Support for hyperparameters with JSON serialized and non serialized keys altogether * Training Environment transforms to env vars * Created beta framework entrypoint * Filter SageMaker provided hyperparameters and user provided hyperparameters * Script mode * Cache module installation * Support to requirements.txt * Decoder/Encoder support for numpy, JSON, and CSV ## v1.0.4 * bug: Configuration: Change module names to string in __all__ * bug: Environment: handle hyperparameter injected by tuning jobs ## v1.0.3 * bug: Training: Move processing of requirements file out to the specific container. ## v1.0.2 * feature: TrainingEnvironment: read new environment variable for job name ## v1.0.1 * feature: Documentation: add descriptive README ## v1.0.0 * Initial commit