Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Printer + Screen freeze mid print, heaters on #18315

Closed
taragor opened this issue Jun 15, 2020 · 45 comments
Closed

[BUG] Printer + Screen freeze mid print, heaters on #18315

taragor opened this issue Jun 15, 2020 · 45 comments

Comments

@taragor
Copy link

taragor commented Jun 15, 2020

Bug Description

I've had printer freeze randomly during prints since upgrading my Mainboard to SKR mini E3 V2.0 (TMC2209, STM32F1, 24V).
Screen the screen freezes, the printer won't respond to USB commands, the heaters remain on.
I can't say if the PID is still working, or if the heaters are just going full blast/keeping their temp.
I've had 5 failed prints by now, and all of them failed during curved perimeters, so I think it might be related to that.
However that issue occurs only once every ~15-20h. It happened always between 1-5h into the print, retrying the same gcode works.
I can't trigger it deliberately so trying different configurations is pretty slow.
What didn't fix it for me is:
-disabling Linear Advance
-changing jerk values
-printing from SD/USB
-recompiling/reflashing
-It is not temperature related, I've had one failed print ~1h in, first print that day, others failed after 10+ hours.

The only constants through all my failures were:
-Allways (5/5 failed prints) fails during curved perimeters
-Only happens with print speed(cura setting) >=70mm/s (I print most of the time at 70mm/s, however I printed ~20h on 50mm/s and had no fail)

My Configurations

Config.zip

Steps to Reproduce

I can't really reproduce it. Neither printing especially curvy things nor high retraction files (as suggested in #18117) triggers it. However I believe it happens more frequently at higher print speeds.

Expected behavior: [What you expect to happen]

Printer prints file (or online via USB) normally

Actual behavior: [What actually happens]

Printer freezes mid print (both from SD and via USB):
-All movement just stops
-Fans keep spinning
-Screen freezes
-USB becomes unresponsive
-Steppers stay powered
-Heaters remain heated

@qwewer0
Copy link
Contributor

qwewer0 commented Jun 15, 2020

Try the latest bugfix version.

@XDA-Bam
Copy link
Contributor

XDA-Bam commented Jun 15, 2020

This issue is probably related to #18226 and a duplicate of #18019

@taragor Please check if I'm correct. If so, please close this and add your info to the existing issue.

@taragor
Copy link
Author

taragor commented Jun 15, 2020

Thanks, I gues #18226 would turn of the heaters at least, however I'm unsure if that would fix the root of the issue, since the print won't resume normally after a board reset, will it?
I can not see how this should be related to #18019 however

@XDA-Bam
Copy link
Contributor

XDA-Bam commented Jun 15, 2020

I can not see how this should be related to #18019 however

It's not. That should have been #18117, but Github switched it 🤷

@taragor
Copy link
Author

taragor commented Jun 16, 2020

It might be the same Issue like #18117, however the suggested steps to reproduce their bug don't work for me.
I've also found #17161 which sounds like being the same issue, except the heaters are probably turned off by Watchdog on LPC176. I've posted a new issue since thinkyhead suggested doing so in that issue.

@minosg
Copy link

minosg commented Jun 17, 2020

Taragor quick question. What display are you using?

@taragor
Copy link
Author

taragor commented Jun 17, 2020

I have been using the stock ender 3 display, which is called cr-10 style in Marlin, I think. I just yesterday changed it to BTTs TFT35 E3.
I had no fail with that, but I only printed some small calibration parts which took about 3h.

Do you think it is related to the display?

@minosg
Copy link

minosg commented Jun 17, 2020

I have been testing for #18117 in the last week. The issue does not appear to be related to the drivers since they do not assert the DIAX and Index pins when the freeze happens.

In order to make it appear more often you need to push your printer by setting considerably high acceleration in the begging of your gcode if your slicer does not do that.

M201 X9000 Y9000 Z500 E5000 ; sets maximum accelerations, mm/sec^2 M203 X500 Y500 Z12 E80 ; sets maximum feedrates, mm/sec M204 P1000 R1000 T1000 ; sets acceleration (P, T) and retract acceleration (R), mm/sec^2 M205 X10.00 Y10.00 Z0.20 E2.00 ; sets the jerk limits, mm/sec M205 S0 T0 ; sets the minimum extruding and travel feed rate, mm/sec

With those settings I can consistently crash it in one out of three prints.

Now onto the issue I started looking at planner.cpp and something it does called screen throttling. In short this file is the mind behind a print job. If for some reason it sends an update to the screen and the screen does not respond, it will freeze.

I have yet to verify how and why that may be the case, but I have compiled marlin without a display, and have been printing from SD using M21 & M24 commands without any issue for two days so far.

That is still a bug that we need to figure out, since 99% of the users do use a display with their printer.

@taragor
Copy link
Author

taragor commented Jun 19, 2020

Ok, after testing prints with that high acceleration settings I can confirm that it seems to cause freezes more often. Still all my fails happened on curves, which kind of makes sense, since the times between stepper commands should be way shorter on curves.
I'm now trying prints with display disabled. I've still got the BTT TFT35 connected in BTT mode, yet it acts like a standalone computer connected via UART rather than a standard display in that mode. I'll try printing some things with that configuration and report if any more freezes happen.

@boelle
Copy link
Contributor

boelle commented Jun 21, 2020

@taragor still an issue?

@taragor
Copy link
Author

taragor commented Jun 21, 2020

Yes, after some further testing it seems @minosg is totally right and this is an UART Race condition, since I couldn't print for longer than 2 hours without freezes using suggested settings (i.e. high acceleration, high speeds etc.) . However after disabling my display and printing headless off the onboard sd I printed without any freezes for 10+ hour using said settings.
I've also found, that printing online using just one serial connection (in addition to the Stepper drivers) seems to be working fine too, which kind of makes sense, since this connection is somewhat synchronized with the printers movement.

@boelle
Copy link
Contributor

boelle commented Jun 21, 2020

hmmm just guessing but could it be noise and to long display cables?

@minosg
Copy link

minosg commented Jun 21, 2020

@boelle cables shouldn't matter. In my traces I can see the isr firing before it has completed, never retuning in thread mode. The nvic register is also in constant firing state.

If a cable was responsible for it would happen by jigging the cable even in slower speeds. This correlation with planner overload and freeze indicates a race condition

Also it usually happens in two methods, the idle handler where it tries to update the progress and the host keep alive transmit. Both occasions are an incident of usart landing when the logic is already in exception mode

@taragor
Copy link
Author

taragor commented Jun 21, 2020

I also thought about that, obviously the motors are creating some electric/magnetic fields that could cause glitches in the cables. Yet I'm using the cables that came with my Ender 3 Pro and had no issues while using the stock melzi.
Also i just had a freeze with no display is enabled/connected and printing live via USB, which should be relatively tolerant against EM distortion.
I've also had a freeze with apparently a thermal runaway. The bed was at 85C (measured externally), despite being set to 60C, when I caught it. This in combination with the disabled Watchdog on STM32F1 (#18226) is a fire hazard.

EDIT: The thermal runaway might be a one of, I've got a feeling my last firmware.bin might have been corrupted, since I'm seeing all kinds of strange behavior (diagonal movements being executed first x, then y; Random "Unknown command G1", etc...). I'll recompile and reflash and report back.

Edit 2: On second thought I might just checkout current bugfix again since I'm now getting strange compiler errors about neopixel library not being found

@taragor
Copy link
Author

taragor commented Jun 21, 2020

OK so I've just compiled marlin from scratch, using new config files, and now the printer his working again. I haven't had a freeze yet, but I'm still printing with only one serial connection to my BTT TFT35 in BTT mode, which is acting like pronterface or any other live print host in this mode, so I don't really expect to see too many freezes that way.
However the compiler error when trying to enable #define NEOPIXEL_LED still persists in the newly pulled branch. When disabling the neoPixel support it build just fine. I'll do some more testing and open a new Issue if that bug consists, since it's totally unrelated to this

@ellensp
Copy link
Contributor

ellensp commented Jun 21, 2020

neopixel library is disabled... in platformio.ini for STM32F103RC_btt baords
In [common_stm32f1]

lib_ignore    =
  Adafruit NeoPixel

I don't know why... probably as they are to much of a load on the processor.. but it does compile if you enable it.

@sjasonsmith
Copy link
Contributor

I don't know why... probably as they are to much of a load on the processor.. but it does compile if you enable it.

It probably caused problems at some point and was disabled. If someone actually sets up NeoPixels and verifies it works, they would be welcome to post a Pull Request to change it.

@minosg
Copy link

minosg commented Jun 24, 2020

Neopixels are tricky on this board. This board is timer limited, and the neopixel's pin timer is needed. A recent change changed the Servo8 timer to the same one used by the NeoPixel so that can cause the probe to crash. In the past this was used for tone.cpp generation.

It can be made to work it will be in a non standard way and users should be aware of the risks, so I suspect this is why it was disabled at a stage

@taragor
Copy link
Author

taragor commented Jun 24, 2020

I got it to work using this fork of the neopixel lib: https://github.com/ccccmagicboy/Adafruit_NeoPixel. It's the one used in the STM32F103RC_meeb build config. It requires some personalization of the lib though, since it uses marlins delay.h for timing.
The neopixel lib by BTT also works, yet some pixels are getting wrong colors, which is no surprise since their approach is using bitbashing and hardcoded NOP counting.
Adafruits lib does not build at all for STM32F1 since its using GPIO LL functions which are not available in the arduino core for this platform. I'll try to figure out what the issue there is, since the implementation looks sound otherwise.

@rmangino
Copy link

OK so I've just compiled marlin from scratch, using new config files, and now the printer his working again.

I am running into the same exact issue with the same hardware. What version of the code did you use... bug fix, latest release, development, etc.? Thank you

@taragor
Copy link
Author

taragor commented Jun 25, 2020

@rmangino I'm sorry, I think I was somewhat unclear there. I've got the printer to run again as before so the initial issue persists. I still have crashes with the display enabled in marlin. What I meant was that the constant crashes when even moving only one motor went away. I suspect my firmware.bin was corrupted during copying.
For now I'm using the current bugix with the following config:
-display support disabled in marlin
-BTT TFT35 connected only using the TFT header, so it's only running in BTT mode (I'm using this for control and as printhost for prints that require pauses/filament changes)
-Octoprint disconnected (I've had freezes with octopi connected via USB, but I guess you could use octoprint or pronterface instead of the TFT35.)

It's working stable so far for me with no freezes in 30+ hours of printing. However as soon as i enable display support in marlin I start seeing freezes again. Also using more than just the 2 serial connections (TMC2209 + TFT35) gives me occasional freezes.

I believe this issue is caused by the serial race condition discussed in #18358.

@minosg
Copy link

minosg commented Jun 25, 2020

@taragor

Are you seeing freezes using the TFT35 in the EXP2 port, using the thick header cable? Does this screen work without display support in Marlin enabled?

@taragor
Copy link
Author

taragor commented Jun 25, 2020

@minosg TFT35 is a dual mode display, you can switch between modes by keeping the dial pressed. It has its own MCU (funnily enough it is also an STM32F1) and firmware. The modes are:

  • Touch (or BTT) mode: the screen is connected using a 5 pin cable (+5V; GND; TX; RX; RST) connecting the RS232 port on the screen to the TFT port on the SKR mini. In this mode the screen runs like a computer with pronterface or octoprint, you can move axis, change settings but it's done by sending gcodes via UART to marlin. You can also start prints of the printers onboard SD, or use the screens SD slot or USB Stick to host prints via the serial connection.

  • LCD12864 Emulation (or Marlin) mode. here the screen emulates the stock Ender 3 (or similar ones) screen. Here you have two options:
    You can either connect either the EXP-3 on the screen (2x5 pin ribbon) to the EXP1 on SKR mini (you need to enable #define CR10_STOCKDISPLAY)
    or use EXP-1 + EXP-2 on the screen (both 2x5 pin ribbon) to boards like SKR pro (I don't know what you need to enable for that, since I don't have a board that supports this)

When the screen is connected using the EXP-3 connector (Marlin mode, emulating the stock screen), so #define CR10_STOCKDISPLAY is enabled I get freezes, no matter what else is connected.
When the screen is connected using only the TFT cable (Touch mode, the screen operates like pronterface), #define CR10_STOCKDISPLAY disabled, and octoprint connected via USB I get freezes, however that only happened twice for me, I will do more testing with that settup.
With the screen connected using only the TFT cable, #define CR10_STOCKDISPLAY disabled and USB disconnected I haven't had any freezes in 30+ hours of printing. It works with both printing of the printers own SD, or the screen hosting the print from its USB drive or SD card

@minosg
Copy link

minosg commented Jun 25, 2020

@taragor what you are describing fits my hypothesis about the cause of the issue being timing related.

When you compile out cr10 display support the ui.update() logic is faster which is called by Idle() as the planner moves between blocks.

Similarly when you are printing with octoprint there it a timer interrupt firing which is the host keep alive message.

When using the serial header on the board, I suspect you are printing by injecting gcode commands like sd so the host keep alive should not be triggering (usb cdc) and the ui.Update() is compiled out so the time between critical section transitions in the block buffer is fixed.

just to verify we are in the right path you could try one more test. Try disabling host keep alive from config, disable cr10 display (meaning you compile without a marlin display) and print from octoprint. It shouldn't freeze

@taragor
Copy link
Author

taragor commented Jun 25, 2020

@minosg should I disconnect the screen completely for that or should I leave the TFT cable connected?

@minosg
Copy link

minosg commented Jun 25, 2020

If its compiled out it shouldn't matter. This is clearly a software bug

@taragor
Copy link
Author

taragor commented Jun 25, 2020

With the TFT cable connected it will run in BTT mode even with CR-10 support disabled, since it's running like any other serial host (i.e. pronterface)

@rmangino
Copy link

@rmangino I'm sorry, I think I was somewhat unclear there. I've got the printer to run again as before so the initial issue persists. I still have crashes with the display enabled in marlin. What I meant was that the constant crashes when even moving only one motor went away. I suspect my firmware.bin was corrupted during copying.
For now I'm using the current bugix with the following config:
-display support disabled in marlin
-BTT TFT35 connected only using the TFT header, so it's only running in BTT mode (I'm using this for control and as printhost for prints that require pauses/filament changes)
-Octoprint disconnected (I've had freezes with octopi connected via USB, but I guess you could use octoprint or pronterface instead of the TFT35.)

I believe this issue is caused by the serial race condition discussed in #18358.

@taragor Thank you so much for your response/clarifications. This is the first time I've built Marlin so this is all very new to me. What I know for certain is that my stock Ender 3 Pro never hung in 3 months of very heavy use. With the SKR Mini E3 v2.0 (running the bugfix code) I can't print for more than a few hours without the printer locking up (I'm using Octoprint). I'm still using the stock LCD that came with the printer. The only thing I've added (and enabled in Marlin) is a v3 BLTouch.

@taragor
Copy link
Author

taragor commented Jun 25, 2020

@rmangino

Thank you so much for your response/clarifications. This is the first time I've built Marlin so this is all very new to me. What I know for certain is that my stock Ender 3 Pro never hung in 3 months of very heavy use. With the SKR Mini E3 v2.0 (running the bugfix code) I can't print for more than a few hours without the printer locking up (I'm using Octoprint). I'm still using the stock LCD that came with the printer. The only thing I've added (and enabled in Marlin) is a v3 BLTouch.

That was exactly my experience too. From what I managed to gather this Issue appears sometimes but then got closed due to a lack of activity. Minosg is the first one who came up with an explanation of what happens but thanks to his work I think this bug can be fixed soon.
In the meantime the only advice I can offer is to try and disable the display in marlin and either connect to the SKR using the TFT port and a Raspberry's GPIO (if you are using that, since that should work like the TFT35) or try and disable host keep alive, like minosg suggested. I haven't tried to do so, but I will run some tests over the weekend.

@minosg
Copy link

minosg commented Jun 25, 2020

@taragor

I would not be holding my breath about this bug being fixed soon. It is an extremely nasty bug, which is hard to reproduce. It took us two weeks to figure out exactly how to make it happen reliably. The core reason that it went unnoticed is that uart-based drivers were not that common, and even last years boards used to run tmc2208 in legacy mode. Now with the industry moving on and seeing new boards using uart for drivers you end up having 4 to 6 more interrupts in the system, and absolutely no safeguards or guarantees that the time critical parts of the code are being be respected, you kept on seeing a new bug ticket on something related to it every week.

The planner and stepper code are complicated, thousands of lines of code with a high level of physics and math involved in it. my knowledge of Marlin is limited to couple of weeks, and making any changes could affect a lot of existing users. What our best bet is a the moment, is to isolate the specific conditions that trigger this deadlock and hope to fix it by using something something already in the code which is non destructive, like a planner.synchronise()

@rmangino
Copy link

rmangino commented Jun 26, 2020

@taragor

just to verify we are in the right path you could try one more test. Try disabling host keep alive from config, disable cr10 display (meaning you compile without a marlin display) and print from octoprint. It shouldn't freeze

Sorry that I am so new to Marlin but I'd like to be sure I'm building the firmware correctly based upon the above suggestions. Btw, I am printing from Octoprint.

In Configuration.h I have:

//
// Host Keepalive
//
// When enabled Marlin will send a busy status message to the host
// every couple of seconds when it can't accept commands.
//
//#define HOST_KEEPALIVE_FEATURE        // Disable this if your host doesn't like keepalive messages
//#define DEFAULT_KEEPALIVE_INTERVAL 2  // Number of seconds between "busy" messages. Set with M113.
//#define BUSY_WHILE_HEATING            // Some hosts require "busy" messages even during heating

and

//
// Factory display for Creality CR-10
// https://www.aliexpress.com/item/32833148327.html
//
// This is RAMPS-compatible using a single 10-pin connector.
// (For CR-10 owners who want to replace the Melzi Creality board but retain the display)
//
//#define CR10_STOCKDISPLAY

Are those changes sufficient? Thank you in advance.

@minosg
Copy link

minosg commented Jun 26, 2020

Yes that should disable keep alive and display. Try to see if that stops the freezes with octoprint

@taragor
Copy link
Author

taragor commented Jun 28, 2020

Ok I've tested live printing of octoprint via USB with #define CR10_STOCKDISPLAY and host keep alive disabled. I haven't had a crash in ~8h of printing. I've tried having the TFT35 disconnected and connected and both worked without issues.
I'm also currently testing if changing the resolution settings to limit the minimum distance of a single move with screen enabled reduces the freezes, since I suspect the freezes are more related to the "frequency" of moves since that should be what triggers the Interrupts. I know that the bug still exists but that might at least reduce the probability of freezes happening. I'll report if it helps at all, once I've run some more tests.

@minosg
Copy link

minosg commented Jun 28, 2020

@targor Planner contains a very time accurate logic at the end of a move. A move is stored as a chain of micro moves in a ring buffer. At the end of each move there is a reset the buffer logic which is a bit sensitive, so if an interrupt fires at this point all hell breaks loose. Which is why you can see it more often when printing curves ( a curve is sliced as series of very small linear moves)

You got it working with the screen enabled and host-keep alive disabled? Was that using the default 11500 baud rate? It could just be an instance of not triggering it yet, since you need the interrupts to fire at a very narrow window. Can you try the same setup with Malyan_LCD which is fixed at 50.000 Baud?

@rmangino
Copy link

rmangino commented Jun 28, 2020

I'd also like to add that I was able to complete two, 12-hour prints successfully. These are my first successful prints since upgrading to the SKR Mini e3 v2.0 + bugfix branch. I'm using a build with display and keep_alives disabled (printing via Octoprint USB).

@minosg
Copy link

minosg commented Jun 28, 2020

Can anyone confirm if they can get away with printing from LCD with disabled host_keep_alive? And if so which LCD driver are you using?

@taragor
Copy link
Author

taragor commented Jun 28, 2020

@minosg I'm not using any display support from marlin (#define CR10_STOCKDISPLAY and all other displays disabled). However I have my TFT35 connected using just a single Serial connection (connected to the TFT header on my SKR mini, in marlin the port is defined by #define SERIAL_PORT 2), so the screen works in touch mode only. This means that the display runs off its own MCU, and is communicating with the printer using gcodes via said serial connection, like pronterface for example.
I can print live with the TFT35 hosting the print (again it just sends GCODES like pronterface would), and I can print live from octoprint which is connected to the USB port (defined in marlin by #define SERIAL_PORT_2 -1), however I only had crashes when printing from Octoprint with host keep alive enabled. Since I disabled host keep alive I can print live using Octoprint without any crashes.
During all tests I had the TFT35 connected to only using the TFT header, so EXP3 is disconnected and #define CR10_STOCKDISPLAY is disabled. When printing live the TFT35 doesn't really do much (it only shows the temperatures and its main menu) since it's just like you had a second Pronterface connected while printing from the first one.
Both serial connections (Serial 1 and 2) are running at 115200 baud, I don't currently know what speed the serial connection to the steppers is running at, but I didn't change that so I guess it's 115200.
If you need any more info/testing, just let me know.

@rmangino
Copy link

Can anyone confirm if they can get away with printing from LCD with disabled host_keep_alive? And if so which LCD driver are you using?

My original firmware config had host_keep_alive_disabled and CR10_STOCKDISPLAY defined. That was the build that locked up every few hours.

@minosg
Copy link

minosg commented Jul 1, 2020

To anyone following this thread I have posted a possible workaround for this issue on #18358 . It involves performing a minor patch your libmaple library. Please test if you would like to confirm the findings of that solution.

@rmangino
Copy link

rmangino commented Jul 2, 2020

@minosg You are suggesting that we disable MONITOR_DRIVER_STATUS? There are no other modifications?

@minosg
Copy link

minosg commented Jul 2, 2020

Nope. Look a bit higher. You need to patch the usart handler in lib maple

@rmangino
Copy link

rmangino commented Jul 2, 2020

Thank you. For anyone else - here is a direct link to his suggestion.

@github-actions
Copy link

github-actions bot commented Aug 2, 2020

This issue is stale because it has been open 30 days with no activity. Remove stale label / comment or this will be closed in 5 days.

@sjasonsmith
Copy link
Contributor

Duplicate of #18358

@sjasonsmith sjasonsmith marked this as a duplicate of #18358 Aug 8, 2020
@github-actions
Copy link

github-actions bot commented Oct 8, 2020

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants