Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during CUBLAS logging #1062

Open
maleadt opened this issue Jul 22, 2021 · 6 comments
Open

Segfault during CUBLAS logging #1062

maleadt opened this issue Jul 22, 2021 · 6 comments
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers. needs information Further information is requested

Comments

@maleadt
Copy link
Member

maleadt commented Jul 22, 2021

Apparently there's still some issue with the logger:
image

As reported by @femtomc, encountered on CUDA.jl 3.3.4 with JULIA_DEBUG=CUDA.

@maleadt maleadt added bug Something isn't working cuda libraries Stuff about CUDA library wrappers. labels Jul 22, 2021
@femtomc
Copy link

femtomc commented Jul 22, 2021

Device info:

ubuntu in mbecker in ~ on ☁️  (us-east-2)
❯ neofetch                                                                                                                                                                                               ~ master
            .-/+oossssoo+/-.               ubuntu@mbecker
        `:+ssssssssssssssssss+:`           --------------
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 20.04.2 LTS x86_64
    .ossssssssssssssssssdMMMNysssso.       Host: t3.xlarge
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.8.0-1038-aws
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 10 days, 1 hour, 9 mins
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 772 (dpkg), 6 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: zsh 5.8
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Terminal: /dev/pts/4
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: Intel Xeon Platinum 8259CL (4) @ 2.499GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   GPU: 00:03.0 Amazon.com, Inc. Device 1111
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Memory: 1644MiB / 15827MiB
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
  +sssssssssdmydMMMMMMMMddddyssssssss+
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.


ubuntu in mbecker in ~ on ☁️  (us-east-2)
❯ julia                                                                                                                                                                                                  ~ master
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.6.1 (2021-04-23)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake-avx512)
Environment:
  JULIA_VERSION = 1.6.1

@maleadt
Copy link
Member Author

maleadt commented Jul 22, 2021

Haven't been able to reproduce; I let the cublas tests run for a couple of hours with JULIA_DEBUG=CUDA set...

@femtomc
Copy link

femtomc commented Jul 22, 2021

@maleadt At the very least, that convinces me it might not be a CUDA issue -- but rather something in Distributed related to task handling.

I was also also able to produce segfaults by trying to log information from the GPU on a task before moving it to the CPU with cpu.

@femtomc
Copy link

femtomc commented Jul 22, 2021

Possible better to change title of issue as I continue to investigate.

@ericphanson
Copy link

@femtomc pointed out this could be related to #1314 (I believe the issue did occur with CUDA in a sysimage)

@maleadt
Copy link
Member Author

maleadt commented Jan 12, 2022

That's likely, as these callbacks also use @cfunction (ref JuliaLang/julia#43748). On the other hand, the backtrace here points to only Julia code, so is likely to have happened on a Julia thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers. needs information Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants