-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
debug_nans
error always says the de-optimized function did not produce NaNs
#24955
Comments
Thanks, @emilyfertig ! I noticed this a few months ago, started a PR to fix it, and then let it languish. This regressed when we did the jit/pjit merge more than a year ago. Let me see if I can revive the PR... |
I noticed something else that might be related: the error message with
And here's the same code, run with 0.4.36:
Also, the current version no longer prints "Invalid value encountered in the output of a jit function. Calling the de-optimized version." (sometimes it does, but I haven't figured out how to consistently reproduce it. I tried flushing the log buffer so I don't think it's that). @mattjj If you have a start at a PR I'd be happy to take it over (especially if you think it'd be a good way to learn about this part of the code and wouldn't be too much to bite off as I'm getting ramped up). |
The above behavior (printing the call site only and not the line in the function where the NaN occurred) is more recent. 0.4.35 (released 10/22) still prints the exact line. |
For now #24989 comments out parts of the docs/error message that aren't consistent with how the code behaves. |
Culprit for the second issue appears to be 32bf19a |
Description
I'm working on documentation for
debug_nans
and I wrote the following function, which for certain input values callsjnp.log
on a negative number, producing anan
value.It fails with this error, indicating that a NaN was returned from the compiled function but not
fun.call_wrapped
. It's the same if I replacelog
withsqrt
, if I remove thejit
decorator, or if I just calljnp.log
on a negative value withoutjit
.The error message is misleading because NaNs are returned from the de-optimized functions as well, since it's taking the log of a negative value. I think something is going wrong in the code path taken in
_pjit_call_impl_python
but I can't tell what.cc @yashk2810 since it looks like you've worked on this area of the code a fair amount.
System info (python version, jaxlib version, accelerator, etc.)
Reproducible across a few different environments, but e.g.:
jax: 0.4.36
jaxlib: 0.4.36
numpy: 2.1.3
python: 3.11.8 (stable, redacted, redacted) [Clang google3-trunk (f58ce1152703ca753794b8cef36da30bd2668d0f)]
device info: Tesla V100-SXM2-16GB-1, 1 local devices"
process_count: 1
platform: uname_result(system='Linux', node='b6e5614622812f47-3e7e1adbbf9.borgtask.google.com', release='5.10.0-smp-1104.53.0.0', version='#1 [v5.10.0-1104.53.0.0] SMP @1727505643', machine='x86_64')
$ nvidia-smi
Mon Nov 18 12:42:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:B3:00.0 Off | 0 |
| N/A 41C P0 72W / 300W | 12433MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 829280 C ...fb3717c109/mount/server/ml_notebook 12430MiB |
+---------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered: