Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warning msg/documentation on the tf32 related system flags and usage #6754

Closed
wyli opened this issue Jul 21, 2023 · 9 comments
Closed

warning msg/documentation on the tf32 related system flags and usage #6754

wyli opened this issue Jul 21, 2023 · 9 comments

Comments

@wyli
Copy link
Contributor

wyli commented Jul 21, 2023

(follow up of #6525) My larger concern is that other operations in monai will be also affected by the tf32 issue (since all operations uses cuda.matmul are affected). This may lead to significant reproducibility issues.

My proposal is adding something like
https://github.com/Lightning-AI/lightning/pull/16037/files#diff-909e246d6c36514f952ae5023bd9fbcc3e8f2c6a0837ebf81d7dc96790b5f938R190-R210
to related classes/functions in monai. Then, monai will print warnings when the flag is True. Not sure when it is better to print warnings, maybe during import? Maybe warnings can be suppressed when the flage is explicitly set by users, but it seems technically challenging.
&
adding a part in the documentation to educate users how to use tf32 properly.

Originally posted by @qingpeng9802 in #6525 (comment)

@qingpeng9802
Copy link
Contributor

Also, found a related part in the repo, fyi

MONAI/tests/utils.py

Lines 173 to 198 in 2800a76

def is_tf32_env():
"""
The environment variable NVIDIA_TF32_OVERRIDE=0 will override any defaults
or programmatic configuration of NVIDIA libraries, and consequently,
cuBLAS will not accelerate FP32 computations with TF32 tensor cores.
"""
global _tf32_enabled
if _tf32_enabled is None:
_tf32_enabled = False
if (
torch.cuda.is_available()
and not version_leq(f"{torch.version.cuda}", "10.100")
and os.environ.get("NVIDIA_TF32_OVERRIDE", "1") != "0"
and torch.cuda.device_count() > 0 # at least 11.0
):
try:
# with TF32 enabled, the speed is ~8x faster, but the precision has ~2 digits less in the result
g_gpu = torch.Generator(device="cuda")
g_gpu.manual_seed(2147483647)
a_full = torch.randn(1024, 1024, dtype=torch.double, device="cuda", generator=g_gpu)
b_full = torch.randn(1024, 1024, dtype=torch.double, device="cuda", generator=g_gpu)
_tf32_enabled = (a_full.float() @ b_full.float() - a_full @ b_full).abs().max().item() > 0.001 # 0.1713
except BaseException:
pass
print(f"tf32 enabled: {_tf32_enabled}")
return _tf32_enabled

wyli pushed a commit that referenced this issue Jul 26, 2023
about  #6754 .


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [x] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>
@qingpeng9802 qingpeng9802 mentioned this issue Aug 3, 2023
7 tasks
wyli pushed a commit that referenced this issue Aug 7, 2023
about #6754 .

### Description

show a warning if any thing may enable tf32 is detected

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [x] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>
@wyli wyli closed this as completed Aug 10, 2023
@myron
Copy link
Collaborator

myron commented Aug 15, 2023

@qingpeng9802 @wyli

Guys, I understand what you're trying to do, but I train on multi-gpu and the screens starts full of Warnings, which is a bit overwhelming

a) is there a way to disable these warnings? ( I do know that TF32 is enabled)
b) does this new check introduce some overhead? seems like every process in DDP ran it separately

/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating                                                                    
  warnings.warn(                                                                                                                                                
/usr/local/lib/python3.10/dist-packages/monai/utils/tf32.py:76: UserWarning: Environment variable `TORCH_ALLOW_TF32_CUBLAS_OVERRIDE = 1` is set.                
  This environment variable may enable TF32 mode accidentally and affect precision.                                                                             
  See https://docs.monai.io/en/latest/precision_accelerating.html#precision-and-accelerating 

@qingpeng9802
Copy link
Contributor

a) There is currently no way to disable it, maybe we can add an environment variable MONAI_ALLOW_TF32 like other libs did.
b) Not sure how the code calls. My guess is that each subprocess import monai once here. If my guess is correct, the overhead should be okay.

Could you provide the code snippets related to import monai and DDP?

@wyli
Copy link
Contributor Author

wyli commented Aug 15, 2023

I think the main ambiguity from a user's perspective is often from this particular setting: export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 and torch.backends.cuda.matmul.allow_tf32=False, which will enable tf32. how about we only warn this setting?

@qingpeng9802
Copy link
Contributor

I think the main ambiguity from a user's perspective is often from this particular setting: export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 and torch.backends.cuda.matmul.allow_tf32=False, which will enable tf32. how about we only warn this setting?

The thing is actually a bit complicated.
When TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, PyTorch will set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32.
When NVIDIA_TF32_OVERRIDE=1, PyTorch will not set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32 (by NVIDIA lib internally)
Thus, it is kind of hard to infer the user's intention.

@wyli
Copy link
Contributor Author

wyli commented Aug 15, 2023

When TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1, PyTorch will set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32.

ok, looks like it's consistent in pytorch 2.0, then I think there's no need to warn in this case?

When NVIDIA_TF32_OVERRIDE=1, PyTorch will not set torch.backends.cuda.matmul.allow_tf32 to True, and uses tf32 (by NVIDIA lib internally) Thus, it is kind of hard to infer the user's intention.

I don't think in regular use cases NVIDIA_TF32_OVERRIDE should be set, because it potentially change all the underlying libs/frameworks, our current code correctly warn this case.

Since there are some changes in the previous versions of pytorch on this topic, perhaps we can focus on proper warnings for torch>=2.0 only. what do you think?

@qingpeng9802
Copy link
Contributor

qingpeng9802 commented Aug 15, 2023

TORCH_ALLOW_TF32_CUBLAS_OVERRIDE affects the precision by https://github.com/pytorch/pytorch/blob/v2.0.1-rc4/aten/src/ATen/Context.h#L294, and this is introduced by the issue Lightning-AI/pytorch-lightning#12997 mentioned. Thus, the version boundary should be 1.12? (not sure)

ok, looks like it's consistent in pytorch 2.0, then I think there's no need to warn in this case?

The behavior of PyTorch is consistent, but for the users, it seems a bit hard to troubleshoot, just like the root issue of this issue. This is essentially a tradeoff for bothering experienced and inexperienced users.

I would suggest to add an environment variable as a flag to suppress the warnings. There is a similar idea in huggingface/transformers#16588 (comment)

@myron
Copy link
Collaborator

myron commented Aug 27, 2023

Guys, I'm running the AutoRunner() from monai on 8 gpus, these WARNINGS are overwhelming. It printed them 16 times (probably form DataAnalyzer() which creates several parallel processes), then another 8 WARNINGS when training starts.

Can we please disable these warnings. Or at least show it just one time, and not so many. thank you.

@wyli
Copy link
Contributor Author

wyli commented Aug 27, 2023

thanks@myron I'm creating a feature request and will have a look soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants