Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6.0 freezes then crashes while rotating model on one system, works fine on another #10968

Open
1 of 2 tasks
DavidCPlatt opened this issue Jul 11, 2023 · 10 comments
Open
1 of 2 tasks

Comments

@DavidCPlatt
Copy link

Description of the bug

I've run into a reproducible problem when attempting to run 2.6.0 on one of my Debian systems. The same 2.6.0 appimage works fine on another Debian system, and the Debian distro build of 2.5.0 works fine on both.

The affected system is a 4-core i5-3470, using the integrated IvyBridge video adapter and GPU. The GPU appears to be relevant to the issue.

Symptoms:

When attempting to free-rotate the model before slicing, the system freezes for a few seconds, the rotation arrows appear briefly and sporadically on the screen, and the program crashes with an "aborted" error and no further useful information.

During the freeze, the whole system appears to be frozen... and using the "intel-gpu-top" monitoring program seems to show that it was in 100% "render" state during the freeze.

Performing the same operation under the Debian disto build of 2.5.0 does not exhibit the problem. Neither does the 2.6.0 appimage, when running on my Debian "testing" laptop.

I cloned the github repo, and built the 2.6.0 version without difficulty. It exhibits the same crashing behavior. So, it's not apparently an AppImage-related issue.

I ran the version I built under gdb, reproduced the error, and generated a backtrace. The abort() is occurring within the "crocus" DRI library, called during a buffer swap by wxGLCanvasX11 during rendering.

Thread 1 "slic3r_main" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=, signo=signo@entry=6,
no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation
(threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0)
at ./nptl/pthread_kill.c:44
#1 0x00007ffff6b90d2f in __pthread_kill_internal
(signo=6, threadid=) at ./nptl/pthread_kill.c:78
#2 0x00007ffff6b41ef2 in __GI_raise (sig=sig@entry=6)
at ../sysdeps/posix/raise.c:26
#3 0x00007ffff6b2c472 in __GI_abort () at ./stdlib/abort.c:79
#4 0x00007fffda49e18f in () at /usr/lib/x86_64-linux-gnu/dri/crocus_dri.so
#5 0x00007fffdb41ae69 in () at /usr/lib/x86_64-linux-gnu/dri/crocus_dri.so
#6 0x00007fffda576946 in () at /usr/lib/x86_64-linux-gnu/dri/crocus_dri.so
#7 0x00007fffda4b2c2a in () at /usr/lib/x86_64-linux-gnu/dri/crocus_dri.so
#8 0x00007fffef64ec3e in glLabelObjectEXT ()
at /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#9 0x00007fffef640ab1 in () at /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#10 0x00007fffef63023b in () at /lib/x86_64-linux-gnu/libGLX_mesa.so.0
#11 0x000055555759dbe0 in wxGLCanvasX11::SwapBuffers() ()
#12 0x0000555556892947 in Slic3r::GUI::GLCanvas3D::render() ()
#13 0x0000555556892d7b in Slic3r::GUI::GLCanvas3D::_refresh_if_shown_on_screen() ()
#14 0x000055555689393b in Slic3r::GUI::GLCanvas3D::on_mouse(wxMouseEvent&) ()
#15 0x0000555557afe692 in wxEvtHandler::ProcessEventIfMatchesId(wxEventTableEntryBase const&, wxEvtHandler*, wxEvent&) ()
--Type for more, q to quit, c to continue without paging--
#16 0x0000555557afea37 in wxEvtHandler::SearchDynamicEventTable(wxEvent&) ()
#17 0x0000555557afeb90 in wxEvtHandler::TryHereOnly(wxEvent&) ()
#18 0x0000555557afec3a in wxEvtHandler::ProcessEventLocally(wxEvent&) ()
#19 0x0000555557afece1 in wxEvtHandler::ProcessEvent(wxEvent&) ()
#20 0x0000555557aff5a7 in wxEvtHandler::SafelyProcessEvent(wxEvent&) ()
#21 0x0000555557846107 in gtk_window_enter_callback ()
#22 0x00007ffff7290cb4 in () at /lib/x86_64-linux-gnu/libgtk-3.so.0
#23 0x00007ffff7aad5a9 in () at /lib/x86_64-linux-gnu/libgobject-2.0.so.0
#24 0x00007ffff7ac605e in g_signal_emit_valist ()
at /lib/x86_64-linux-gnu/libgobject-2.0.so.0
#25 0x00007ffff7ac6dbf in g_signal_emit ()
at /lib/x86_64-linux-gnu/libgobject-2.0.so.0
#26 0x00007ffff75697d4 in () at /lib/x86_64-linux-gnu/libgtk-3.so.0
#27 0x00007ffff7409286 in gtk_main_do_event ()
at /lib/x86_64-linux-gnu/libgtk-3.so.0
#28 0x00007ffff7b32815 in () at /lib/x86_64-linux-gnu/libgdk-3.so.0
#29 0x00007ffff7b8c702 in () at /lib/x86_64-linux-gnu/libgdk-3.so.0
#30 0x00007ffff6e1a7a9 in g_main_context_dispatch ()
at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#31 0x00007ffff6e1aa38 in () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#32 0x00007ffff6e1acef in g_main_loop_run ()
at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#33 0x00007ffff7408435 in gtk_main () at /lib/x86_64-linux-gnu/libgtk-3.so.0
--Type for more, q to quit, c to continue without paging--c
#34 0x000055555782a735 in wxGUIEventLoop::DoRun() ()
#35 0x00005555579b1b9d in wxEventLoopBase::Run() ()
#36 0x000055555795280d in wxAppConsoleBase::OnRun() ()
#37 0x0000555557a38964 in wxEntry(int&, wchar_t**) ()
#38 0x000055555640a077 in Slic3r::GUI::GUI_Run(Slic3r::GUI::GUI_InitParams&) ()
#39 0x0000555555a70e49 in Slic3r::CLI::run(int, char**) ()
#40 0x0000555555a346a3 in main ()

After reading up on the "crocus" DRI library and its current state, I tried re-running the test, setting MESA_LOADER_DRIVER_OVERRIDE=i965 in the environment (thus forcing the use of the legacy i965 DRI engine rather than crocus).

The problem went away. The freeze was gone, the rotation arrows and the rotation process operated smoothly, intel-gpu-top showed reasonable render/3D usage, and the program did not crash.

I'm going to hunch here: some change in the canvas-drawing code in 2.6.0 (or in wxGLCanvasX11) is issuing Mesa operations which is tickling a bug of some sort in the new crocus renderer. Maybe a pathological pattern of objects, maybe something which triggers memory or other resource exhaustion... ??

Project file & How to reproduce

2.6.0-rendering-crash-in-crocus.zip

To reproduce:

  1. Open project.
  2. Rotate model image so you're looking at it "level, eyes on" from the front of the build plate.
  3. Click the "rotate" button.
  4. [Try to] rotate the image clockwise by clicking-and-dragging the arrows.

If the problem is not present, everything will work fine.

If the problem is present, you may notice some or all of the following:

  • In step 3, when you click the "rotate" button or hit R, the screen will freeze for several seconds. If you're running intel-gpu-top, then after the screen un-freezes it will display a very high (80-100%) rendering burden for a second or so.
  • The rotation axis indicators and arrows may or may not appear. If they do appear, they may vanish immediately, before you have a chance to grab them.
  • If you do grab a rotation arrow and try to rotate the model (say, 30-45 degrees) it will not rotate smoothly. It may not rotate at all, or may rotate just a short distance. The application may then crash in the crocus DRI code.

If you do succeed in reproducing the problem, then try

MESA_LOADER_DRIVER_OVERRIDE=i965 /path/to/prusa-slicer

or

MESA_LOADER_DRIVER_OVERRIDE=i965 gdb /path/to/prusa-slicer

and observe that the problem does not occur with the DRI driver override in place.

Checklist of files included above

  • Project file
  • Screenshot

Version of PrusaSlicer

2.6.0 (using github tag)

Operating system

Linux, Debian "bookworm"

Printer model

Original Prusa i3 MK3S & MK3S+

@pellcorp
Copy link

I also have this issue on 2.6.0 both snap and gtk3 app image.

I am using following version of ubuntu:

Linux shed 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I also running 3rd gen cpu - Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz

Can confirm that the MESA_LOADER_DRIVER_OVERRIDE=i965 also fixes my issue.

@DavidCPlatt
Copy link
Author

That's useful information, thanks! So, it's not specific to this exact CPU, or to just one kernel version (I'm running Debian 6.1.0-10-amd64), but may be specific to the CPU/GPU family.

I've started doing bisection builds:

2.5.2 - bug not present
2.6.0 beta1 - bug present
2.6.0 alpha3 - bug present
2.6.0 alpha1 - bug present

I'm building 2.6.0 alpha0 overnight, and will fine things down further from there tomorrow.

@DavidCPlatt
Copy link
Author

2.6.0 alpha0 - bug not present

@DavidCPlatt
Copy link
Author

OK, I've managed to narrow down the problem somewhat.

The problem was definitely introduced between 2.6.0 alpha0 and 2.6.0 alpha1.

It appears to be triggered by some code (there were a couple of versions) which tries to make use of a more sophisticated shader program "dashed_thick_lines" when drawing certain on-screen artifacts. The instance which triggers the problem is in GLGizmoRotate::on_render() where it's rendering the "how much has the model been rotated" indicator arc.

The code checks to see if the GL manager believes that the OpenGL implementation supports the core profile. If so, the "dashed_thick_lines" shader is selected. If not (or if core-profile support is disabled at compile time) a simpler "flat" shader is selected.

  • If the "crocus" DRI engine has been loaded, it will either freeze up momentarily, glitch, or crash when the "dashed_thick_lines" tries to draw the rotation-indication arc (on this GPU, at least).
  • If I force the use of the old DRI engine (with MESA_LOADER_DRIVER_OVERRIDE=i915) the crash does not occur.
  • If I disable OpenGL acceleration (with MESA_LOADER_DRIVER_OVERRIDE=none) and force the use of software rendering, the crash does not occur.
  • If I modify the code in this one place to force the use of the "flat" shader (e.g. by disabling the "is OpenGL core support enabled?" flag), the crash does not occur.

There are several other places in the code where this shader is used - for example, in Selection, to draw the "selection box corners" around a selected object. I haven't ever seen a crash in this code. In Selection, the lines being drawn are short and straight.

I'd guess that something is going wrong when (1) "dashed_thick_lines" shader is being used, (2) on an old Intel HD GPU, (3) by the crocus DRI engine, (4) to draw a curved arc. I have no good idea whether the shader program is defective (or inefficient) in some way, or whether the crocus engine is at fault in this case.

Vendor: Intel (0x8086)
Device: Mesa Intel(R) HD Graphics 2500 (IVB GT1) (0x152)
Version: 22.3.6
Accelerated: yes
Video memory: 1536MB
Unified memory: yes
Preferred profile: core (0x1)
Max core profile version: 4.2
Max compat profile version: 4.2
Max GLES1 profile version: 1.1
Max GLES[23] profile version: 3.0

I haven't seen any crashes at all with the i915 renderer, with Mesa software rendering, or with the AMDGPU renderer used by the HD 2100 card I ebay'ed a few days ago ( a great cheap upgrade for this 10-year-old PC).

So, it seems likely it's a crocus problem, triggered by this one particular shader program, and only in certain instances (e.g. arcs, or arcs with a large number of points?).

Possible workarounds:

  1. Using MESA_LOADER_DRIVER_OVERRIDE=i915 or MESA_LOADER_DRIVER_OVERRIDE=none when launching the app.
  2. Edit src/libslic3r/Technologies.hpp and #define ENABLE_GL_CORE_PROFILE 0 down near the bottom of the file. This will force the use of the flat shader everywhere. Recompile.
  3. Edit src/slic3r/GUI/Gizmos/GLGizmoRotate.cpp, find GLGizmoRotate::on_render(), and change it to always use the "flat" shader even if the core profile is selected. Recompile.

There are some comments elsewhere in the code that the "dashed_thick_lines" is too slow to use on Macs... but the 2.6.0 code changes have put it into use in many areas in the code. Maybe this is a decision which should be reconsidered?

@DavidCPlatt
Copy link
Author

The plot thickens.

Looking through the source code for the crocus userland driver code, I found only one abort() call. It occurs if the attempt to send a command to the GPU returns an error. Unfortunately, the code which would print out the specific error code in question is behind an #if DEBUG clause, and is thus disabled in release builds.

On a hunch I did a "journalctl --system | grep i915" and the results are very interesting. On each run of the program I see something like:

Jul 16 17:16:44 worker kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:85ff9ff8, in slic3r_main [1442467]
Jul 16 17:16:44 worker kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Jul 16 17:16:44 worker kernel: i915 0000:00:02.0: [drm] slic3r_main[1442467] context reset due to GPU hang
Jul 16 17:17:04 worker kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:85ff9ff8, in slic3r_main [1442467]
Jul 16 17:17:04 worker kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Jul 16 17:17:04 worker kernel: i915 0000:00:02.0: [drm] slic3r_main[1442467] context reset due to GPU hang
Jul 16 17:17:15 worker kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:85ff9ff8, in slic3r_main [1442467]
Jul 16 17:17:15 worker kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for stopped heartbeat on rcs0
Jul 16 17:17:15 worker kernel: i915 0000:00:02.0: [drm] slic3r_main[1442467] context reset due to GPU hang

The GPU seems to freeze up three times (consistently) on each program run, when I try to click and rotate the model. After the third time, it looks as if the kernel driver rejects the next command, and the crocus code aborts.

So, it looks as if something's amiss with the arc-drawing GPU operation and this triggers a timeout and GPU reset. Maybe drawing a many-vertexed arc with the dashed_thick_lines shader is just so slow that it times out? Maybe the fact that this shader program creates a whole bunch of additional vertices on the fly is causing the GPU to barf?

There are a bunch of interesting INTEL_DEBUG outputs in crocus which can be enabled by an environment variable. Dumping out the batches may give a sense for what's happening.

@DavidCPlatt
Copy link
Author

DavidCPlatt commented Jul 19, 2023

Even more fascinating!

Enabling INTEL_DEBUG=stall (forcing the crocus code to wait for each batch to complete, before starting to work on the next one) does not affect the problem.

Enabling INTEL_DEBUG=no8 or no16 or no32 (restricting use of certain sorts of shader instructions) doesn't affect the problem.

Enabling INTEL_DEBUG=nofc (no fast clear) greatly reduces the severity of the problem. There are still occasional GPU hangs when rotating the STL model on the build plate, but they are much less frequency (maybe once a minute, rather than on every attempt) and I haven't yet seem them result in a crash.

Finally, enabling INTEL_DEBUG=bat (print out an interpretation of the batches) seems to make the problem go away completely. So, we have a Heisenbug on our hands :-)

It's possible that the latter is a timing-related issue - perhaps the delay in doing the debug prints is avoiding a race condition of some sort in the GPU or driver. However, even if I direct the debug output to /dev/null to minimize the delay, the problem remains completely gone.

Interestingly, INTEL_DEBUG=bat has an internal side effect. If you enable it, the crocus code disables the use of the device "shadow copy" feature. It feels to me as if this is likely the source of the Heisenbug effect.

So, it now seems to me as if there's at least one, and perhaps two bugs in the crocus code which supports this particular GPU. Fast clears may not always be working right, and shadow copy may not be working right. I don't see evidence that there's any specific error in the PrusaSlicer code (either the CPU logic, or the specific shader program involved here).

I'll try to pass this along to the Mesa project team. Opened https://gitlab.freedesktop.org/mesa/mesa/-/issues/9392

Best-available workarounds at this point are to force the use of the older Mesa driver for this GPU family, or disable Mesa hardware acceleration completely. Running with INTEL_DEBUG=nofc might also be acceptable, if you want crocus GPU acceleration which mostly works.

@jannekotka
Copy link

not sure if it is related or not, but i'm experiencing a lot of system wide freezing when utilizing the prusa slicer 2.6.0 on windows as well. The system periodically (every few seconds) freezes so that the even the mouse cursor stops moving. I can reproduce this by opening a dialog to load parts to place and just scrolling down. Once the object is placed it will happen also constantly while just working with the app.

Its quite bad, i have to open the app, then grinding my teeth try to slice and quickly close the app to be able to use my computer again. I'm on AMD 5950x with 3090 gpu, its the only app doing this for me, so i'm fairly confident that its not a system issue.

@zexcster
Copy link

Same on windows 10

@mirlang
Copy link

mirlang commented Sep 19, 2023

same on my AMD system, sometimes the GPU freezes completely

Sep 18 10:19:26 hex kernel: amdgpu 0000:07:00.0: amdgpu: GPU reset(2) failed

@pellcorp
Copy link

pellcorp commented Oct 9, 2023

This annoyed me so much I am getting a new computer for my workshop, when I say new I mean a 6th Gen Intel for $180 aud hopefully the problem goes away.

I can confirm the issue does not occur at all on 6th gen intel, which is a huge relief, was so annoying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants