Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs/1627 bug test vmap fails on multi node runs on hardware accelerators #1738

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Dec 4, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

Issue/s resolved: #1627

Changes proposed:

default accuracy in allclose is to high for one test on GPUs and some CPUs

Type of change

decrease accuracy (to tolerance 1e-4, which is still small enough to exclude passing of the test by lucky chance)

@mrfh92 mrfh92 marked this pull request as ready for review December 4, 2024 15:30
Copy link
Contributor

github-actions bot commented Dec 5, 2024

Thank you for the PR!

@mrfh92
Copy link
Collaborator Author

mrfh92 commented Dec 5, 2024

seems to be even more strange: the GitHub runner also only achieves tol=1e-4 on CPU for the respective test
also, the CPU runners for the first check requires lower tol as well

Copy link

codecov bot commented Dec 5, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.26%. Comparing base (3082dd9) to head (2d42080).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1738   +/-   ##
=======================================
  Coverage   92.26%   92.26%           
=======================================
  Files          84       84           
  Lines       12447    12447           
=======================================
  Hits        11484    11484           
  Misses        963      963           
Flag Coverage Δ
unit 92.26% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Dec 9, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Dec 9, 2024

Thank you for the PR!

@mrfh92 mrfh92 self-assigned this Dec 9, 2024
@mrfh92 mrfh92 added bug Something isn't working testing Implementation of tests, or test-related issues manipulations labels Dec 9, 2024
@mrfh92 mrfh92 requested a review from JuanPedroGHM December 9, 2024 16:43
Copy link
Contributor

github-actions bot commented Dec 9, 2024

Thank you for the PR!

@mrfh92 mrfh92 added the PR talk label Dec 16, 2024
Copy link
Contributor

Thank you for the PR!

Copy link
Member

@JuanPedroGHM JuanPedroGHM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some tests, tried with extra configurations on CPU and GPU. I could not find why the error is so large when running on GPU. Test pass with lower tolerance when using float64.

Here is how I tested the multiple configurations, I would recommend rewriting the test like this.

def test_vmap_with_chunks(self):
            x1_splits = [None, 1]
            chunk_sizes = list(range(1,5))
            dtypes = [ht.float32, ht.float64]
            for x1_split in x1_splits:
                for cs in chunk_sizes:
                    for dtype in dtypes: 
                        with self.subTest(x1_split=x1_split, chunk_size=cs, dtype=dtype):
                            # same as before but now with prescribed chunk sizes for the vmap
                            x0 = ht.random.randn(5 * ht.MPI_WORLD.size, 10, 10, split=0, dtype=dtype)
                            x1 = ht.random.randn(10, 5 * ht.MPI_WORLD.size, split=x1_split, dtype=dtype)
                            out_dims = (0, 0)

                            def func(x0, x1, k=2, scale=1e-2):
                                return torch.topk(torch.linalg.svdvals(x0), k)[0] ** 2, scale * x0 @ x1

                            vfunc = ht.vmap(func, out_dims, chunk_size=cs)
                            y0, y1 = vfunc(x0, x1, k=2, scale=-2.2)

                            # compare with torch
                            x0_torch = x0.resplit(None).larray
                            x1_torch = x1.resplit(None).larray
                            vfunc_torch = torch.vmap(func, (0, x1_split), out_dims)
                            y0_torch, y1_torch = vfunc_torch(x0_torch, x1_torch, k=2, scale=-2.2)

                            self.assertTrue(torch.allclose(y0.resplit(None).larray, y0_torch))
                            tol = 1e-12 if dtype == ht.float64 else 1e-4
                            self.assertTrue(torch.allclose(y1.resplit(None).larray, y1_torch, atol=tol))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport release bug Something isn't working core manipulations PR talk testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: test_vmap fails on multi-node runs on hardware accelerators
2 participants