Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P3 - Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN #239

Open
jailuthra opened this issue Jun 15, 2021 · 3 comments
Open

P3 - Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN #239

jailuthra opened this issue Jun 15, 2021 · 3 comments
Labels

Comments

@jailuthra
Copy link
Contributor

jailuthra commented Jun 15, 2021

debug CUDA_ERROR_UNKNOWN errors Why? Should follow up, but hard to debug until P2s are addressed and seem to have stopped.

Describe the bug
The GPU video decoding fails with CUDA_ERROR_UNKNOWN, needing the user to restart the node for future segments. Sometimes it's paired with CUDA_ERROR_OUT_OF_MEMORY or CUDA_ERROR_ILLEGAL_ADDRESS.

To Reproduce
Steps to reproduce the behavior:

  • Unclear as of now.

Expected behavior
Decrease the blast radius of these errors if possible, and figure out the root cause.

Screenshots
ERROR_UNKNOWN
image

ERROR_ILLEGAL_ADDRESS
image

ERROR_OUT_OF_MEMORY
image

Additional context

Stack-trace for future reference:
LPMS - https://github.com/livepeer/lpms/blob/master/ffmpeg/decoder.c#L250
FFmpeg - entry-point https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L610
most-probable line causing the error https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext.c#L629
cuda-specific ctx creation routine
https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L379
cuCtxCreate call https://github.com/FFmpeg/FFmpeg/blob/870bfe16a12bf09dca3a4ae27ef6f81a2de80c40/libavutil/hwcontext_cuda.c#L363

@jailuthra jailuthra added the bug label Jun 15, 2021
@yondonfu yondonfu changed the title GPU decoding failure causes node to be stuck in crash loop. Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN Nov 30, 2022
@yondonfu
Copy link
Member

Update here:

@Quintendos Quintendos changed the title Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN P3 - Nvidia decoding sometimes returns CUDA_ERROR_UNKNOWN Dec 7, 2022
@boratuncer
Copy link

boratuncer commented Jan 8, 2023

Hi,

I'm facing exactly same issue with livepeer V0.5.35, Nvidia driver 525.78.01, CUDA version: 12.0 and power state P0.

It happens on some streams, not all but I hit 28 times this error on last 24h.

@boratuncer
Copy link

hi @yondonfu ,

I've a solution for this, not perfect one, but when it happens I invoke transcoder service again. So basically there are two transcoders and one will be terminated automatically in 3-4 secs. but this prevents me to loose streams because of this CUDA error. Please consider this, or a better implementation, for next releases :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants