Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merges rosetta unit tests and takes off the marker which spun up another matrix job #360

Merged
merged 2 commits into from
Nov 8, 2023

Conversation

terrykong
Copy link
Contributor

This should simplify the rosetta tests and save some time since another matrix job was started for one test

@terrykong terrykong requested a review from yhtang November 6, 2023 20:23
Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@terrykong
Copy link
Contributor Author

Running manually since CI not stable: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/6787945124

Will merge if no unknown error

@terrykong
Copy link
Contributor Author

@ashors1 The ViT tests in the manual run looks like an OOM error for 8G1N and 8G2N. Is that expected?

@ashors1
Copy link
Contributor

ashors1 commented Nov 8, 2023

@ashors1 The ViT tests in the manual run looks like an OOM error for 8G1N and 8G2N. Is that expected?

Yes, those OOMs are also caused by the known XLA bug.

@terrykong terrykong merged commit a391095 into main Nov 8, 2023
51 of 54 checks passed
@terrykong terrykong deleted the rosetta-merge-tests branch November 8, 2023 17:07
ashors1 pushed a commit that referenced this pull request Nov 16, 2023
…her matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test
terrykong pushed a commit that referenced this pull request Dec 1, 2023
author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000
committer Terry Kong <terryk@nvidia.com> 1701417045 -0800

pip-compile changes

Updated t5-large perf (#342)

Update Pax README and sub file (#345)

- Adds FP8 documentation
- Updates perf table
- Makes some other minor improvements for readability

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

Re-enable NVLS in nightly containers (#331)

NVLS was disabled due to a known issue in NCCL 2.17 that caused
intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
safe to re-enable NVLS.

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>

Update Pax TE patch to point to rebased branch (#348)

Loosens t5x loss tests relative tolerances (#343)

Relaxing the relative tolerance on the loss tests since it was leading
to too many false positives. For reference, deviation in loss for the t5
model can sometimes be up to 15% at the start of training with real
data.

Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)

- [ ] Add capability to retroactively test with newer test-t5x.sh like
in
[t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
- [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
logic from before where it was always enabled in rosetta-t5x

Fix markdown hyperlink for jax package on frontpage readme (#319)

Adds a --seed option to test-t5x.sh to ensure determinism (#344)

To ensure that the tests results for a particular container are
reproducible between runs, this change introduces a seed argument that
sets the jax seed and dataset seed to 42. It remains configurable, but
now there shouldn't be variance given the same container.

- Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
this script

Co-authored-by: NVIDIA <jax@nvidia.com>
Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com>

Dynamic workflow run names (#356)

This change introduces the dynamic [run name
field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
`run-name`.

It's currently difficult on mobile to find the "workflow_run" that
corresponds to a particular date, so hopefully this helps identify which
builds were nightly vs which builds were manually triggered.

I couldn't find a good way to dynamically look up the `name` field, so
for now I copied all of names. I also wasn't able to find a "created_at"
for the scheduled workflows, so those don't have timestamps for now.

__Assumptions__:
* "workflow_run" == nightly since "scheduled" events only happen on
`main` and `workflow_run` are only run for concrete workflows and not
reusable workflows

- [x] Test the workflow_run codepath
- [x] Test the scheduled codepath

![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)

Fix random failling tests for backend_independent on V100 (#351)

Fixes randomly failures in the backend-independent section of JAX unit
tests:
```
Cannot find a free accelerator to run the test  on, exiting with failure
```

Changes: limit the number of concurrent test jobs even for
backend-independent tests, which do create GPU contexts.

As a clarification, `--jobs` and `--local_test_jobs` do not make a
difference for our particular CI pipeline, since JAX is built in a
separate CI job anyway.

References (From Reed Wanderman-Milne @ Google):

> 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
correctly or you can get that error (I recently got the same error by
not setting those correctly)
> 2. (also I think --jobs should be --local_test_jobs in that code
block, no reason to restrict the number of jobs compiling JAX)

Propagate error code in ViT tests (#357)

Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test

Propagate build failures (#363)

Always run the `publish-build` step, regardless of whether the rosetta
pax/t5x build was attempted. This ensures that badges correctly reflect
build failures due to dependent builds failing.

Patch for JAX core container (ARM64) (#367)

Add patch to XLA to be able to build JAX core container for ARM64

Update the doc for USE_FP8 (#349)

This PR provides guidance on how to use the new configuration option,
`USE_FP8`, to enable native FP8 support on Hopper GPUs.

Update the native-fp8 guide with cudnn layer norm (#368)

This PR updates the guide to include the new flag to enable the cudnn
layer norm.

cc. @ashors1 @terrykong @nouiz

Add WAR for XLA NCCL bug causing OOMs (#362)

A stopgap for #346

fix TE multi-device test

fix lzma build issue

edit TE test name

fix TE arm64 test install error

remove --install option from get-source.sh

fix TE arm64 test install error

disable sandbox

i'm jet-lagged

use Pax image for TE testing

Fix job dependency
terrykong pushed a commit that referenced this pull request Dec 1, 2023
author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000
committer Terry Kong <terryk@nvidia.com> 1701417045 -0800

pip-compile changes

Updated t5-large perf (#342)

Update Pax README and sub file (#345)

- Adds FP8 documentation
- Updates perf table
- Makes some other minor improvements for readability

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

Re-enable NVLS in nightly containers (#331)

NVLS was disabled due to a known issue in NCCL 2.17 that caused
intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
safe to re-enable NVLS.

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>

Update Pax TE patch to point to rebased branch (#348)

Loosens t5x loss tests relative tolerances (#343)

Relaxing the relative tolerance on the loss tests since it was leading
to too many false positives. For reference, deviation in loss for the t5
model can sometimes be up to 15% at the start of training with real
data.

Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)

- [ ] Add capability to retroactively test with newer test-t5x.sh like
in
[t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
- [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
logic from before where it was always enabled in rosetta-t5x

Fix markdown hyperlink for jax package on frontpage readme (#319)

Adds a --seed option to test-t5x.sh to ensure determinism (#344)

To ensure that the tests results for a particular container are
reproducible between runs, this change introduces a seed argument that
sets the jax seed and dataset seed to 42. It remains configurable, but
now there shouldn't be variance given the same container.

- Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
this script

Co-authored-by: NVIDIA <jax@nvidia.com>
Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com>

Dynamic workflow run names (#356)

This change introduces the dynamic [run name
field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
`run-name`.

It's currently difficult on mobile to find the "workflow_run" that
corresponds to a particular date, so hopefully this helps identify which
builds were nightly vs which builds were manually triggered.

I couldn't find a good way to dynamically look up the `name` field, so
for now I copied all of names. I also wasn't able to find a "created_at"
for the scheduled workflows, so those don't have timestamps for now.

__Assumptions__:
* "workflow_run" == nightly since "scheduled" events only happen on
`main` and `workflow_run` are only run for concrete workflows and not
reusable workflows

- [x] Test the workflow_run codepath
- [x] Test the scheduled codepath

![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)

Fix random failling tests for backend_independent on V100 (#351)

Fixes randomly failures in the backend-independent section of JAX unit
tests:
```
Cannot find a free accelerator to run the test  on, exiting with failure
```

Changes: limit the number of concurrent test jobs even for
backend-independent tests, which do create GPU contexts.

As a clarification, `--jobs` and `--local_test_jobs` do not make a
difference for our particular CI pipeline, since JAX is built in a
separate CI job anyway.

References (From Reed Wanderman-Milne @ Google):

> 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
correctly or you can get that error (I recently got the same error by
not setting those correctly)
> 2. (also I think --jobs should be --local_test_jobs in that code
block, no reason to restrict the number of jobs compiling JAX)

Propagate error code in ViT tests (#357)

Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test

Propagate build failures (#363)

Always run the `publish-build` step, regardless of whether the rosetta
pax/t5x build was attempted. This ensures that badges correctly reflect
build failures due to dependent builds failing.

Patch for JAX core container (ARM64) (#367)

Add patch to XLA to be able to build JAX core container for ARM64

Update the doc for USE_FP8 (#349)

This PR provides guidance on how to use the new configuration option,
`USE_FP8`, to enable native FP8 support on Hopper GPUs.

Update the native-fp8 guide with cudnn layer norm (#368)

This PR updates the guide to include the new flag to enable the cudnn
layer norm.

cc. @ashors1 @terrykong @nouiz

Add WAR for XLA NCCL bug causing OOMs (#362)

A stopgap for #346

fix TE multi-device test

fix lzma build issue

edit TE test name

fix TE arm64 test install error

remove --install option from get-source.sh

fix TE arm64 test install error

disable sandbox

i'm jet-lagged

use Pax image for TE testing

Fix job dependency
terrykong pushed a commit that referenced this pull request Dec 7, 2023
author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000
committer Terry Kong <terryk@nvidia.com> 1701417045 -0800

pip-compile changes

Updated t5-large perf (#342)

Update Pax README and sub file (#345)

- Adds FP8 documentation
- Updates perf table
- Makes some other minor improvements for readability

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

Re-enable NVLS in nightly containers (#331)

NVLS was disabled due to a known issue in NCCL 2.17 that caused
intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
safe to re-enable NVLS.

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>

Update Pax TE patch to point to rebased branch (#348)

Loosens t5x loss tests relative tolerances (#343)

Relaxing the relative tolerance on the loss tests since it was leading
to too many false positives. For reference, deviation in loss for the t5
model can sometimes be up to 15% at the start of training with real
data.

Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)

- [ ] Add capability to retroactively test with newer test-t5x.sh like
in
[t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
- [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
logic from before where it was always enabled in rosetta-t5x

Fix markdown hyperlink for jax package on frontpage readme (#319)

Adds a --seed option to test-t5x.sh to ensure determinism (#344)

To ensure that the tests results for a particular container are
reproducible between runs, this change introduces a seed argument that
sets the jax seed and dataset seed to 42. It remains configurable, but
now there shouldn't be variance given the same container.

- Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
this script

Co-authored-by: NVIDIA <jax@nvidia.com>
Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com>

Dynamic workflow run names (#356)

This change introduces the dynamic [run name
field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
`run-name`.

It's currently difficult on mobile to find the "workflow_run" that
corresponds to a particular date, so hopefully this helps identify which
builds were nightly vs which builds were manually triggered.

I couldn't find a good way to dynamically look up the `name` field, so
for now I copied all of names. I also wasn't able to find a "created_at"
for the scheduled workflows, so those don't have timestamps for now.

__Assumptions__:
* "workflow_run" == nightly since "scheduled" events only happen on
`main` and `workflow_run` are only run for concrete workflows and not
reusable workflows

- [x] Test the workflow_run codepath
- [x] Test the scheduled codepath

![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)

Fix random failling tests for backend_independent on V100 (#351)

Fixes randomly failures in the backend-independent section of JAX unit
tests:
```
Cannot find a free accelerator to run the test  on, exiting with failure
```

Changes: limit the number of concurrent test jobs even for
backend-independent tests, which do create GPU contexts.

As a clarification, `--jobs` and `--local_test_jobs` do not make a
difference for our particular CI pipeline, since JAX is built in a
separate CI job anyway.

References (From Reed Wanderman-Milne @ Google):

> 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
correctly or you can get that error (I recently got the same error by
not setting those correctly)
> 2. (also I think --jobs should be --local_test_jobs in that code
block, no reason to restrict the number of jobs compiling JAX)

Propagate error code in ViT tests (#357)

Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test

Propagate build failures (#363)

Always run the `publish-build` step, regardless of whether the rosetta
pax/t5x build was attempted. This ensures that badges correctly reflect
build failures due to dependent builds failing.

Patch for JAX core container (ARM64) (#367)

Add patch to XLA to be able to build JAX core container for ARM64

Update the doc for USE_FP8 (#349)

This PR provides guidance on how to use the new configuration option,
`USE_FP8`, to enable native FP8 support on Hopper GPUs.

Update the native-fp8 guide with cudnn layer norm (#368)

This PR updates the guide to include the new flag to enable the cudnn
layer norm.

cc. @ashors1 @terrykong @nouiz

Add WAR for XLA NCCL bug causing OOMs (#362)

A stopgap for #346

fix TE multi-device test

fix lzma build issue

edit TE test name

fix TE arm64 test install error

remove --install option from get-source.sh

fix TE arm64 test install error

disable sandbox

i'm jet-lagged

use Pax image for TE testing

Fix job dependency
terrykong pushed a commit that referenced this pull request Dec 8, 2023
author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000
committer Terry Kong <terryk@nvidia.com> 1701417045 -0800

pip-compile changes

Updated t5-large perf (#342)

Update Pax README and sub file (#345)

- Adds FP8 documentation
- Updates perf table
- Makes some other minor improvements for readability

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

Re-enable NVLS in nightly containers (#331)

NVLS was disabled due to a known issue in NCCL 2.17 that caused
intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
safe to re-enable NVLS.

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>

Update Pax TE patch to point to rebased branch (#348)

Loosens t5x loss tests relative tolerances (#343)

Relaxing the relative tolerance on the loss tests since it was leading
to too many false positives. For reference, deviation in loss for the t5
model can sometimes be up to 15% at the start of training with real
data.

Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)

- [ ] Add capability to retroactively test with newer test-t5x.sh like
in
[t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
- [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
logic from before where it was always enabled in rosetta-t5x

Fix markdown hyperlink for jax package on frontpage readme (#319)

Adds a --seed option to test-t5x.sh to ensure determinism (#344)

To ensure that the tests results for a particular container are
reproducible between runs, this change introduces a seed argument that
sets the jax seed and dataset seed to 42. It remains configurable, but
now there shouldn't be variance given the same container.

- Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
this script

Co-authored-by: NVIDIA <jax@nvidia.com>
Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com>

Dynamic workflow run names (#356)

This change introduces the dynamic [run name
field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
`run-name`.

It's currently difficult on mobile to find the "workflow_run" that
corresponds to a particular date, so hopefully this helps identify which
builds were nightly vs which builds were manually triggered.

I couldn't find a good way to dynamically look up the `name` field, so
for now I copied all of names. I also wasn't able to find a "created_at"
for the scheduled workflows, so those don't have timestamps for now.

__Assumptions__:
* "workflow_run" == nightly since "scheduled" events only happen on
`main` and `workflow_run` are only run for concrete workflows and not
reusable workflows

- [x] Test the workflow_run codepath
- [x] Test the scheduled codepath

![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)

Fix random failling tests for backend_independent on V100 (#351)

Fixes randomly failures in the backend-independent section of JAX unit
tests:
```
Cannot find a free accelerator to run the test  on, exiting with failure
```

Changes: limit the number of concurrent test jobs even for
backend-independent tests, which do create GPU contexts.

As a clarification, `--jobs` and `--local_test_jobs` do not make a
difference for our particular CI pipeline, since JAX is built in a
separate CI job anyway.

References (From Reed Wanderman-Milne @ Google):

> 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
correctly or you can get that error (I recently got the same error by
not setting those correctly)
> 2. (also I think --jobs should be --local_test_jobs in that code
block, no reason to restrict the number of jobs compiling JAX)

Propagate error code in ViT tests (#357)

Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test

Propagate build failures (#363)

Always run the `publish-build` step, regardless of whether the rosetta
pax/t5x build was attempted. This ensures that badges correctly reflect
build failures due to dependent builds failing.

Patch for JAX core container (ARM64) (#367)

Add patch to XLA to be able to build JAX core container for ARM64

Update the doc for USE_FP8 (#349)

This PR provides guidance on how to use the new configuration option,
`USE_FP8`, to enable native FP8 support on Hopper GPUs.

Update the native-fp8 guide with cudnn layer norm (#368)

This PR updates the guide to include the new flag to enable the cudnn
layer norm.

cc. @ashors1 @terrykong @nouiz

Add WAR for XLA NCCL bug causing OOMs (#362)

A stopgap for #346

fix TE multi-device test

fix lzma build issue

edit TE test name

fix TE arm64 test install error

remove --install option from get-source.sh

fix TE arm64 test install error

disable sandbox

i'm jet-lagged

use Pax image for TE testing

Fix job dependency
terrykong pushed a commit that referenced this pull request Dec 8, 2023
author Yu-Hang Tang <Tang.Maxin@gmail.com> 1698050497 +0000
committer Terry Kong <terryk@nvidia.com> 1701417045 -0800

pip-compile changes

Updated t5-large perf (#342)

Update Pax README and sub file (#345)

- Adds FP8 documentation
- Updates perf table
- Makes some other minor improvements for readability

Adds CUDA_MODULE_LOADING=EAGER to core jax container env vars (#329)

Re-enable NVLS in nightly containers (#331)

NVLS was disabled due to a known issue in NCCL 2.17 that caused
intermittent hangs. The issue has been resolved in NCCL 2.18, so we are
safe to re-enable NVLS.

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>

Update Pax TE patch to point to rebased branch (#348)

Loosens t5x loss tests relative tolerances (#343)

Relaxing the relative tolerance on the loss tests since it was leading
to too many false positives. For reference, deviation in loss for the t5
model can sometimes be up to 15% at the start of training with real
data.

Adds rosetta-t5x TE + no-TE tests that enable the correct configs for testing (#332)

- [ ] Add capability to retroactively test with newer test-t5x.sh like
in
[t5x-wget-test](https://github.com/NVIDIA/JAX-Toolbox/tree/t5x-wget-test)
- [ ] Sets `ENABLE_TE=1` in the Dockerfile.t5x which is identical to the
logic from before where it was always enabled in rosetta-t5x

Fix markdown hyperlink for jax package on frontpage readme (#319)

Adds a --seed option to test-t5x.sh to ensure determinism (#344)

To ensure that the tests results for a particular container are
reproducible between runs, this change introduces a seed argument that
sets the jax seed and dataset seed to 42. It remains configurable, but
now there shouldn't be variance given the same container.

- Also fixes a typo where --steps-per-epoch wasn't in the usage doc of
this script

Co-authored-by: NVIDIA <jax@nvidia.com>
Co-authored-by: Yu-Hang "Maxin" Tang <Tang.Maxin@gmail.com>

Dynamic workflow run names (#356)

This change introduces the dynamic [run name
field](https://github.blog/changelog/2022-09-26-github-actions-dynamic-names-for-workflow-runs/#:~:text=GitHub%20Actions%20customers%20can%20now,visit%20the%20GitHub%20Actions%20community.)
`run-name`.

It's currently difficult on mobile to find the "workflow_run" that
corresponds to a particular date, so hopefully this helps identify which
builds were nightly vs which builds were manually triggered.

I couldn't find a good way to dynamically look up the `name` field, so
for now I copied all of names. I also wasn't able to find a "created_at"
for the scheduled workflows, so those don't have timestamps for now.

__Assumptions__:
* "workflow_run" == nightly since "scheduled" events only happen on
`main` and `workflow_run` are only run for concrete workflows and not
reusable workflows

- [x] Test the workflow_run codepath
- [x] Test the scheduled codepath

![image](https://github.com/NVIDIA/JAX-Toolbox/assets/7576060/4b916452-334a-4a73-9220-9fbadc70462f)

Fix random failling tests for backend_independent on V100 (#351)

Fixes randomly failures in the backend-independent section of JAX unit
tests:
```
Cannot find a free accelerator to run the test  on, exiting with failure
```

Changes: limit the number of concurrent test jobs even for
backend-independent tests, which do create GPU contexts.

As a clarification, `--jobs` and `--local_test_jobs` do not make a
difference for our particular CI pipeline, since JAX is built in a
separate CI job anyway.

References (From Reed Wanderman-Milne @ Google):

> 1. In particular, you have to set NB_GPUS, JOBS_PER_ACC, and J
correctly or you can get that error (I recently got the same error by
not setting those correctly)
> 2. (also I think --jobs should be --local_test_jobs in that code
block, no reason to restrict the number of jobs compiling JAX)

Propagate error code in ViT tests (#357)

Merges rosetta unit tests and takes off the marker which spun up another matrix job (#360)

This should simplify the rosetta tests and save some time since another
matrix job was started for one test

Propagate build failures (#363)

Always run the `publish-build` step, regardless of whether the rosetta
pax/t5x build was attempted. This ensures that badges correctly reflect
build failures due to dependent builds failing.

Patch for JAX core container (ARM64) (#367)

Add patch to XLA to be able to build JAX core container for ARM64

Update the doc for USE_FP8 (#349)

This PR provides guidance on how to use the new configuration option,
`USE_FP8`, to enable native FP8 support on Hopper GPUs.

Update the native-fp8 guide with cudnn layer norm (#368)

This PR updates the guide to include the new flag to enable the cudnn
layer norm.

cc. @ashors1 @terrykong @nouiz

Add WAR for XLA NCCL bug causing OOMs (#362)

A stopgap for #346

fix TE multi-device test

fix lzma build issue

edit TE test name

fix TE arm64 test install error

remove --install option from get-source.sh

fix TE arm64 test install error

disable sandbox

i'm jet-lagged

use Pax image for TE testing

Fix job dependency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants