Question about stuck condition, probably due to constant GPU memory purging #38

Muxas · 2024-03-12T15:44:19Z

Hi, again!

As you already know, I like StarPU! This time I got a stuck condition. Setting WATCHDOG_TIMEOUT=1000000 (1 second) showed, that for some reason on a server with GPUs no tasks are finished in such a huge time. I believe the problem is within constant loading and purging of some memory buffers. I mean a task requires two input buffers, but the memory is not enough. So memory manager purges buffer number 1 to make space for a buffer number 2. Then it purges buffer number 2 to make space for a buffer number 1. In the end, what I observe, no task is done in several minutes (more than 100 messages from the watchdog about not finishing a task in last second with a delay of 1 second). It starts happening as I increase problem size (number of tasks increases while size of each task remains the same) while keeping the same hardware. The more tasks are to be computed, the more probability of getting stuck becomes. Any advice how to solve the issue? Did you experience such a problem before? How did you solve it?

During the stuck period I see no changes in nvidia-smi: memory utilization remains the same, while no computing is done (0 percent).

Thank you!

The text was updated successfully, but these errors were encountered:

sthibaul · 2024-03-12T15:50:42Z

To understand what is happening, it would be useful to produce traces, see https://files.inria.fr/starpu/doc/html/OfflinePerformanceTools.html#GeneratingTracesWithFxT

Muxas · 2024-03-14T15:30:57Z

Here is the trace for such a situation (starpu-1.4 branch, commit f1f915c7e622e8ead7feb7c044947c8bf2b29a3a, remote gitlab.inria.fr/starpu/starpu.git):
datarace.paje.trace.tar.gz

sthibaul · 2024-03-15T12:04:07Z

Mmm, the trace does not show any long period of idleness, only some cases where 500ms are apparently spent in a single allocation or free. How did you end the trace? Normally we do have a SIGINT handler that writes the end of the trace.

Muxas · 2024-03-15T12:44:58Z

How did you end the trace?

Environment STARPU_WATCHDOG_CRASH=1 did it for me. Watchdog timeout was set to 1 second.

sthibaul added the question Further information is requested label Mar 12, 2024

sthibaul added question Further information is requested and removed question Further information is requested labels Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about stuck condition, probably due to constant GPU memory purging #38

Question about stuck condition, probably due to constant GPU memory purging #38

Muxas commented Mar 12, 2024

sthibaul commented Mar 12, 2024

Muxas commented Mar 14, 2024

sthibaul commented Mar 15, 2024

Muxas commented Mar 15, 2024 •

edited

Loading

Question about stuck condition, probably due to constant GPU memory purging #38

Question about stuck condition, probably due to constant GPU memory purging #38

Comments

Muxas commented Mar 12, 2024

sthibaul commented Mar 12, 2024

Muxas commented Mar 14, 2024

sthibaul commented Mar 15, 2024

Muxas commented Mar 15, 2024 • edited Loading

Muxas commented Mar 15, 2024 •

edited

Loading