Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars #329

terrykong · 2023-10-23T22:06:39Z

No description provided.

@ashors1

author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000 committer Terry Kong <terryk@nvidia.com> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <jax@nvidia.com> Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency

@ashors1

author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000 committer Terry Kong <terryk@nvidia.com> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <jax@nvidia.com> Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency

@ashors1

author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000 committer Terry Kong <terryk@nvidia.com> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <jax@nvidia.com> Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency

@ashors1

author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000 committer Terry Kong <terryk@nvidia.com> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <jax@nvidia.com> Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency

@ashors1

author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000 committer Terry Kong <terryk@nvidia.com> 1701417045 -0800 pip-compile changes Updated t5-large perf (#342) Update Pax README and sub file (#345) - Adds FP8 documentation - Updates perf table - Makes some other minor improvements for readability Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329) Re-enable NVLS in nightly containers (#331) NVLS was disabled due to a known issue in NCCL 2.17 that caused intermittent hangs. The issue has been resolved in NCCL 2.18, so we are safe to re-enable NVLS. --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Update Pax TE patch to point to rebased branch (#348) Loosens t5x loss tests relative tolerances (#343) Relaxing the relative tolerance on the loss tests since it was leading to too many false positives. For reference, deviation in loss for the t5 model can sometimes be up to 15% at the start of training with real data. Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332) - [ ] Add capability to retroactively test with newer test-t5x.sh like in [t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test) - [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the logic from before where it was always enabled in rosetta-t5x Fix markdown hyperlink for jax package on frontpage readme (#319) Adds a --seed option to test-t5x.sh to ensure determinism (#344) To ensure that the tests results for a particular container are reproducible between runs, this change introduces a seed argument that sets the jax seed and dataset seed to 42. It remains configurable, but now there shouldn't be variance given the same container. - Also fixes a typo where --steps-per-epoch wasn't in the usage doc of this script Co-authored-by: NVIDIA <jax@nvidia.com> Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com> Dynamic workflow run names (#356) This change introduces the dynamic [run name field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.) `run-name`. It's currently difficult on mobile to find the "workflow_run" that corresponds to a particular date, so hopefully this helps identify which builds were nightly vs which builds were manually triggered. I couldn't find a good way to dynamically look up the `name` field, so for now I copied all of names. I also wasn't able to find a "created_at" for the scheduled workflows, so those don't have timestamps for now. __Assumptions__: * "workflow_run" == nightly since "scheduled" events only happen on `main` and `workflow_run` are only run for concrete workflows and not reusable workflows - [x] Test the workflow_run codepath - [x] Test the scheduled codepath ![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f) Fix random failling tests for backend_independent on V100 (#351) Fixes randomly failures in the backend-independent section of JAX unit tests: ``` Cannot find a free accelerator to run the test on, exiting with failure ``` Changes: limit the number of concurrent test jobs even for backend-independent tests, which do create GPU contexts. As a clarification, `--jobs` and `--local_test_jobs` do not make a difference for our particular CI pipeline, since JAX is built in a separate CI job anyway. References (From Reed Wanderman-Milne @ Google): > 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J correctly or you can get that error (I recently got the same error by not setting those correctly) > 2. (also I think --jobs should be --local_test_jobs in that code block, no reason to restrict the number of jobs compiling JAX) Propagate error code in ViT tests (#357) Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360) This should simplify the rosetta tests and save some time since another matrix job was started for one test Propagate build failures (#363) Always run the `publish-build` step, regardless of whether the rosetta pax/t5x build was attempted. This ensures that badges correctly reflect build failures due to dependent builds failing. Patch for JAX core container (ARM64) (#367) Add patch to XLA to be able to build JAX core container for ARM64 Update the doc for USE_FP8 (#349) This PR provides guidance on how to use the new configuration option, `USE_FP8`, to enable native FP8 support on Hopper GPUs. Update the native-fp8 guide with cudnn layer norm (#368) This PR updates the guide to include the new flag to enable the cudnn layer norm. cc. @ashors1 @terrykong @nouiz Add WAR for XLA NCCL bug causing OOMs (#362) A stopgap for #346 fix TE multi-device test fix lzma build issue edit TE test name fix TE arm64 test install error remove --install option from get-source.sh fix TE arm64 test install error disable sandbox i'm jet-lagged use Pax image for TE testing Fix job dependency

…#329)" This reverts commit 1657890.

#842) Prefer to set fewer magic variables. Note that these values were anyway not used inside e.g. the JAX unit test environment, so this was a source of inconsistency. Eager loading cuDNN/cuBLAS during XLA compilation can also be noticeably slow. See documentation here: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-environment-variables Replaces #831. Reverts #329. Co-authored-by: ashors1 <ashors@nvidia.com>

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars

04a4ab4

terrykong requested review from nouiz and yhtang October 23, 2023 22:06

nouiz approved these changes Oct 24, 2023

View reviewed changes

yhtang approved these changes Oct 25, 2023

View reviewed changes

terrykong merged commit 1657890 into main Oct 27, 2023
65 of 70 checks passed

terrykong deleted the rosetta-env-vars branch October 27, 2023 06:46

ashors1 pushed a commit that referenced this pull request Nov 16, 2023

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

84d9ef9

olupton mentioned this pull request May 17, 2024

CUDA_MODULE_LOADING: use default lazy loading #831

Closed

olupton added a commit that referenced this pull request May 21, 2024

Revert "Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (…

d4ca050

…#329)" This reverts commit 1657890.

olupton mentioned this pull request May 21, 2024

Revert "Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars" #842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars #329

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars #329

terrykong commented Oct 23, 2023

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars #329

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars #329

Conversation

terrykong commented Oct 23, 2023