Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested interrupts corrupt LR ( AKA printer freezes mid print with heaters on ) #18358

Closed
minosg opened this issue Jun 19, 2020 · 150 comments
Closed
Assignees

Comments

@minosg
Copy link

minosg commented Jun 19, 2020

Bug Description

This ticket is aimed to report a correlation on a common but hard to reproduce issue across a variety of unrelated users and configurations. I have been following older tickets and running experiments on a spare machine of mine using the BTT_SKR_MINI board 1.2 and Malyan_LCD

#18117
#18315
#17161

Possibly related
#18356

On STM32F1 platforms the watchdog is disabled which explains why it freezes in that mode, but I do no think the issue is limited to them
#18226

My Configurations

configurations.zip

Steps to Reproduce

  1. Download any latest marlin version 2.5.0X or latest bugfix
  2. Use a board with TMC steppers in serial mode
  3. Enable an LCD controller with UART connection, MaylanLCD is perferct because of the
    high baud-rate 50000 it requires. ( It makes it more likely to occur ). If your display allows you to
    change the baud rate increase it to its maximum
  4. Enable Linear advance, classic jerk and set your extruder Ejerk to something low like 2.0. This is aimed at increasing the stress on timing tolerances of planner.cpp
  5. Print a high retraction model (Gecko) with high acceleration settings (la 02, EJerk 2.0, max extruder acceleration 5000+, retract on layer change & wipe)
  6. Keep on printing until it freezes. It should never reach 5 times, most of the times it will happen in the first three prints.

Expected behavior:

I expect all the prints to complete

Actual behavior:
At random occurrences the printer freezes with heaters on, and if watchdog is enabled it just resets

Additional Information

As discussed before #18177 , I have confirmed that most of the previous reported workarounds, while they make the issue less frequent, will not fix it.
I have confirmed that it freezes when

  • Try older release tags of marlin.
  • Reduce the TMC serial speed to 115200. Mini e3 is using software serial which is not ideal.
  • Enable DISABLE_MULTI_STEPPING
  • Try with a higher watt power supply ( 220W )
  • Work with a higher voltage threshold 24V instead of 12V ( that is what you just confirmed )
  • Disable steathchop and run the drivers in Spreadcycle mode ( less frequent but it happens )
  • Disable Linear Advance, Hybrid threshold altogether
  • Set minimum square pulse to 2, and 4
  • Enable Square Wave Stepping
  • Enabled velocity ramping by increasing the ihold_delay using tmc_adv()
  • Reduce the uart baud rate of the TMC steppers.
  • Move TMC steppers on hardware uart
  • Enabled disable flow control, and crc on SD card
  • Changing the priority of the TMC stepper to 0

The only common occurrence across all tickets submitted with related behavior in the last 6 months are:

  • TMC steppers are running in UART mode
  • Screen is running in UART mode

I have not been able to locate a single report of it happening with steppers on standalone legacy mode.

What I believe is the actually cause of this issue:
Theory 1
An interrupt kicks in when already in exception mode. Lr 0xfffffff9 indicates a EXC_RETURN state. When trying to recover the original LR it has been corrupted and the code lands in an undefined space

Theory 2:
It could also be possible that the UartInterrupt flag for the screen uart is not properly cleared and the system locks in this loop of libmaple's usart_private.h

static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */
    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }
}

Consecutive triggering of the interrupt could cause stack corruption, resulting in a improperly formatted LR register

Last debug session call stack when the incident occurred

__irq_usart1@0x080093f0 (c:\Users\Minoas\.platformio\packages\framework-arduinoststm32-maple\STM32F1\cores\maple\libmaple\usart_f1.c:204)
<signal handler called>@0xfffffff9 (Unknown Source:0)
ExtUI::getProgress_percent@0x0801cfd0 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui\ui_api.cpp:809)
ExtUI::onIdle@0x0801dabc (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui_malyan_lcd.cpp:476)
MarlinUI::update@0x0801d1c8 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\lcd\extui\ui_api.cpp:1064)
idle@0x0800c7d4 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\MarlinCore.cpp:718)
Planner::get_next_free_block@0x08021b2a (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:649)
Planner::_buffer_steps@0x08021b2a (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:1673)
Planner::buffer_segment@0x08021cf2 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:2709)
Planner::buffer_segment@0x08021d7c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:743)
Planner::buffer_line@0x08021d7c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.cpp:2781)
Planner::buffer_line@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\planner.h:774)
segmented_line_to_destination@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:874)
line_to_destination_cartesian@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:899)
prepare_line_to_destination@0x08020028 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\module\motion.cpp:1087)
GcodeSuite::G0_G1@0x08019e64 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\motion\G0_G1.cpp:105)
GcodeSuite::process_parsed_command@0x08018cdc (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\gcode.cpp:245)
GcodeSuite::process_next_command@0x0801921c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\gcode.cpp:935)
GCodeQueue::advance@0x0801b6b2 (c:\Users\Minoas\bigtree\Marlin\Marlin\src\gcode\queue.cpp:649)
loop@0x0800c89c (c:\Users\Minoas\bigtree\Marlin\Marlin\src\MarlinCore.cpp:1226)
main@0x0800ae8a (c:\Users\Minoas\.platformio\packages\framework-arduinoststm32-maple\STM32F1\cores\maple\main.cpp:41)

Registers state

sp = 0x2000be50
pc = 0x080093f0
lr = 0xfffffff9
psp = 0xcc1fd348
mps = 0x2000be50
XpSR = 0x41000035
primask = basepri = faultmask = control = 0x0

Workaround/Patches

The issue has been correctly identified and can be mitigated by patching your local maple libary, usually residing inside the platform folder

In windows the location of the file which is needed to be edited is:
C:\Users\YOURUSERNAME.platformio\packages\framework-arduinoststm32-maple\STM32F1\system\libmaple\usart_private.h

In MacOS open a terminal and use the following command to find your platformio folder
open ~/.platformio

In Linux it should be in the same location
cd~/.platformio

The callback for usart in usart_private.h needs to be replaced by either patch

PATCH 1


static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */
    volatile uint8_t moufa;
    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
        moufa = (uint8)regs->DR;
    }
}

PATCH 2

static inline __always_inline void usart_irq(ring_buffer *rb, ring_buffer *wb, usart_reg_map *regs) {
    /* Handling RXNEIE and TXEIE interrupts. 
     * RXNE signifies availability of a byte in DR.
     *
     * See table 198 (sec 27.4, p809) in STM document RM0008 rev 15.
     * We enable RXNEIE. */

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_RXNE)) {
#ifdef USART_SAFE_INSERT
        /* If the buffer is full and the user defines USART_SAFE_INSERT,
         * ignore new bytes. */
        rb_safe_insert(rb, (uint8)regs->DR);
#else
        /* By default, push bytes around in the ring buffer. */
        rb_push_insert(rb, (uint8)regs->DR);
#endif
    }
    /* TXE signifies readiness to send a byte to DR. */
    if ((regs->CR1 & USART_CR1_TXEIE) && (regs->SR & USART_SR_TXE)) {
        if (!rb_is_empty(wb))
            regs->DR=rb_remove(wb);
        else
            regs->CR1 &= ~((uint32)USART_CR1_TXEIE); // disable TXEIE
    }

    if ((regs->CR1 & USART_CR1_RXNEIE) && (regs->SR & USART_SR_ORE)) {
        #ifdef USART_SAFE_INSERT
                /* If the buffer is full and the user defines USART_SAFE_INSERT,
                 * ignore new bytes. */
                rb_safe_insert(rb, (uint8)regs->DR);
        #else
                /* By default, push bytes around in the ring buffer. */
                rb_push_insert(rb, (uint8)regs->DR);
        #endif
    }

}

How to choose:

  • The first patch has been tested and confirmed to be working by most users. If you are not sure of which to apply choose that
  • The second patch is what I have been using myself recently, since it has a slightly higher probability to catch a new line character than the first patch. Use if you are adventurous
@minosg
Copy link
Author

minosg commented Jun 19, 2020

In addition to that the ExtUI::getProgress_percent() appears to be invoking MarlinUI::progress_t MarlinUI::_get_progress() which when printing from SD will be calling this slow method

        #if ENABLED(SDSUPPORT)
          ?: TERN(HAS_PRINT_PROGRESS_PERMYRIAD, card.permyriadDone(), card.percentDone())
        #endif

I could be wrong since I am not familiar with UI but it appears as if the code is invoking an SD access method which is relatively slow, and hoping that the TMC stepper interrupt will not trigger in between?

@minosg minosg changed the title Extui: Nested interrupts corrupt LR ( AKA printer freezes mid print with heaters on ) Nested interrupts corrupt LR ( AKA printer freezes mid print with heaters on ) Jun 19, 2020
@boelle
Copy link
Contributor

boelle commented Jun 21, 2020

@minosg still an issue?

@minosg
Copy link
Author

minosg commented Jun 21, 2020

Yes that's still an issue.

The core appears to be dying by recursive interrupt uart calls. It's not something that can be fixed by a config change.

I am using this ticket to provide more technical info of the issue, or we will keep on seeing random my printer freezes bugs.

I will be tracing that behaviour next week. If there is a limited time a ticket is allowed to be open feel free to close it and we can continue to chat on it anyway. Thanks

@boelle
Copy link
Contributor

boelle commented Jun 23, 2020

Please test the bugfix-2.0.x branch to see where it stands.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

@boelle I have tested with latest bugfix and the issue is still here.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

Theory 3:

I have been testing this board using a debugger. The original theory that the LR Is corrupted seems less likely and I believe it is more of an interrupt firing issue.

I believe it was introduced by this pr #14030 , but need deeper understanding of the stepper class.

What happens when the printer freezes.

A freeze will become evident when obviously the printer is no longer moving. The logic is NOT stuck in exception mode, and the interrupt will be returning to the main thread. The code snippet that is of interest is this part

https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/planner.h

FORCE_INLINE static uint8_t moves_free() { return BLOCK_BUFFER_SIZE -1 - movesplanned(); }
FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }
FORCE_INLINE static block_t* get_next_free_block(uint8_t &next_buffer_head, const uint8_t count=1) {

  // Wait until there are enough slots free
  while (moves_free() < count) { idle(); }

  // Return the first available block
  next_buffer_head = next_block_index(block_buffer_head);
  return &block_buffer[block_buffer_head];
}

Which causes the UI update loop to trigger, and the lcd uart to keep on firing. The problem appears to be that this while loop is not breaking .

Chasing the logic the block_tail is only updated in the following method loop which appears is no longer firing

Stepper::pulse_phase_isr()-> discard_current_block -> release_current_block
Stepper::block_phase_isr()-> discard_current_block -> release_current_block

The interrupt configuration on this platform is the following

TMC_STepper SoftwareSerial Timer4
STEP_TIMER_NUM             Timer5
PULSE_TIMER_NUM            Timer5
TEMP_TIMER_NUM             Timer2
SERVO0_TIMER_NUM           Timer8

NVIC_SetPriorityGrouping(0x3);
#define STEP_TIMER_IRQ_PRIO 1
#define TEMP_TIMER_IRQ_PRIO 2

Which appear to be enabled during the time of the freeze

NVIC-ISER0 0x501001c0
NVIC-ISER1 0x00044160

And the interrupt priorities appear to be in order

IPR7_N0 ->  0x20
IPR12_N2-> 0x10
IPR9_N1 -> 0xd0

I have also checked timer5 and the CC register as well as the count register and the timer appears to be counting as expected.

Further notes

This platform appears to not be using instruction and memory barriers for interrupts. I do not know if there is a design choice behind that, but the pull request I quoted is enabling and disabling timers using CPSID, CPSIE instructions, with no guarantees that they are actually in the state that we want them to.

I am not certain on the priority grouping setting. Considering that I cannot see any use of sub-priorities in Marlin Core, are they being used in Mapple or should it be set to 0x4 anyway? Is that code a snippet of when we were using Autopreload with the timers instead of resetting them?

When the board freezes most interrupts but the lcd and the temperature stop firing. This is also very bizarre behavior which I am trying to track down. This is also a very happy coincidence, since if the temperature interrupts dies the hotend wil cause meltdown/fire so its a considerable safety risk. I have disabled interrupt grouping and set the uasart1 ( lcd) interrupt to have the lowest priority (15) and yet it keeps on firing, which indicates there is no interrupt preemption or starvation happening.

What I suspect happens is because of too many interrupts firing , ( TMC set in UART modes individually and not chained in a single line ) something in the stepper.cpp logic mini scheduler is falling out of its timing bounds. The problem is that when the issue happens is way before it becomes evident in the logic. so a longer trace is needed which is not supported by st-link.

To be continued.

@kind3r
Copy link

kind3r commented Jun 24, 2020

I think the issue is not STM32 only as I have a LPC1768 and seing similar symptoms since I added a BTT TFT 35 with Marlin mode LCD to my MKS SGEN L board with TMC2209s in UART mode.

My setup is as follows:

  • Octoprint running on a PI4 connected to TFT 35 (uart5 on PI to uart3 on TFT)
  • TFT connected to MKS SGEN L board via:
    • RS232 port on the TFT to AUX1 port on SGEN L
    • EXP3-CR10 port on the TFT to EXP1 port on the SGEN L

The TFT proxies the commands from the PI to the board. I have slightly modified the BTT firmware to parse the M117 commands and display them in TFT mode.

What I have discovered so far in my tests (granted not yet exhaustive):

  • Higher acceleration produces a more reliable crash
  • Marlin with LCD support with or without the LCD cable connected crashes (4 prints crashed 1h-2h in)
  • Marlin without LCD support seems to work fine (1 print so far 3h45m)
  • When the crash occurs, the serial port on the board becomes unresponsive but if I use the TFT's disconnect/reconnect option it resumes printing (if you are fast enough and Octoprint does not timeout). I am unable yet to determine what happens during the crash (if the watchdog resets the board or not). I guess if serial communication recovery was better handled in the BTT TFT you would only see the print pausing for a few seconds.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

When the crash occurs, the serial port on the board becomes unresponsive but if I use the TFT's disconnect/reconnect option it resumes printing

That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.

But the fact that even slower lcd's cause the issue, could also verify the hypothesis that the added logic on the loop, is what is pushing a timing sensitive task outside of the acceptable boundaries. Compiling out and CLD will significantly decrease the ui.update() loop's time.

Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler

@kind3r
Copy link

kind3r commented Jun 24, 2020

That is quite interesting. I wonder what this logic does. If it is restarting the usb_cdc logic, it will definitely be re-enabling interrupts.

I think this is where the disconnect happens in BTT firmware https://github.com/guruathwal/BIGTREETECH-TouchScreenFirmware/blob/c189809e95c627d1680ef81533b32e8fac56514b/TFT/src/User/Menu/SettingsMenu.c#L53

Still cannot undestand why acceleration affects it. Are you using Linear advance? High acceleration will triggerthe la_isr() which is right in the heart of the stepper.cpp scheduler

LA yes, set to 0.1 (I thought from your tests it did not really matter), acceleration is set to 1000 and speed to 90mm/s.

Starting a second 3h print without LCD support.

@taragor
Copy link

taragor commented Jun 24, 2020

I have a very similar setup: SKR mini E3 V2, BTT TFT35. I had frequent crashes with #define CR10_STOCKDISPLAY enabled. From my testing it doesn't matter whether the stock Ender-3 or the TFT35 is connected.
I've also had crashes with #define CR10_STOCKDISPLAY disabled, TFT35 connected (running in BTT mode) to the TFT header (SERIAL_PORT 2), and octopi connected via USB (SERIAL_PROT_2 -1).
However when using only the TFT35 in BTT mode, with no octoprint I didn't have a single crash no matter how aggressive I set LA, jerk, Accelerations. I've printed 20+ hours with this setup and no issues so far.
So I guess it is somewhat related to the number of serial connections aswell as the kind (stepper, display, or printhost)

I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that? Also I'm still waiting for my STLink to get into some proper debugging.

@kind3r
Copy link

kind3r commented Jun 24, 2020

I will try looping Octopi through the TFT35, didn't know that was possible. Is any configuration required for that?

No special configuration just use guruathwal fork as it fixes a number of issues with uart forwarding. The only issue that still stands is the fact that because the lines sent by Octoprint start with line numbers (Nxxx) the TFT does not parse the information from them so for instance M117 texts are not shown on the display, fan speed does not update etc. I am working on a patch for this (seems to be working fine so far so I will publish this week).

@kind3r
Copy link

kind3r commented Jun 24, 2020

Second print without LCD support froze after 2h :( I was able to resume using the disconnect method.

@p3p
Copy link
Member

p3p commented Jun 24, 2020

@kind3r I'm not sure your issue is related to this one, as the board does not freeze (being an LPC176x the watchdog would catch that if it did so there would be no risk of the heaters being left on),

From your description you are not using the normal usb cdc serial port (which is a 10Mbit, lossless, flow controlled connection), but bridging standard uart through a tft display board, it almost seems like the gcode commands to Marlin are just stopping at some point until you reset the tfts serial connection?

do you have any way to check that there is data getting to LP176x uart while in the frozen state?

@minosg
Copy link
Author

minosg commented Jun 24, 2020

@p3p THe watchdog is what made this issue visible and this is why it had its own ticket, but inherently we are talking about the same Marlin core cause.

The key element to test for this issue is the following
If your printer crashes ( watchdog reset/ freeze ) when printing with high acceleration, and having an LCD compiled.

M201 X9000 Y9000 Z500 E5000 ; sets maximum accelerations, mm/sec^2 M203 X500 Y500 Z12 E80 ; sets maximum feedrates, mm/sec M204 P1000 R1000 T1000 ; sets acceleration (P, T) and retract acceleration (R), mm/sec^2 M205 X10.00 Y10.00 Z0.20 E2.00 ; sets the jerk limits, mm/sec M205 S0 T0 ; sets the minimum extruding and travel feed rate, mm/sec

But does never freeze if there is display compiled, and you are printing from SD using M21, M24 then you are seeing the same issue.

As discussed above I think that the the planner stops discarding blocks, and the firmware freezes waiting for new free blocks. When that happens some interrupts can fire ( temperature, LCD ) while others don't.

@kind3r
Copy link

kind3r commented Jun 24, 2020

It also hapened while printing from the PI connected via USB (in which case I unplugged the USB cable and reconnected it but it was still frozen). I still need to run more tests to make it repeatable, but it seems similar (something uart related in either case). I'm not sure that heaters are still on as I imediatly reacted every time, but next time it happens I will leave it frozen for longer to check if the heaters are really on or not.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

@kind3r Can you please try to print using SD directly and no usb host? Pronterface has a nice UI element ( SD ) to help with that. If you have a freeze, please let me know.

@kind3r
Copy link

kind3r commented Jun 24, 2020

@minosg Could you clarify a bit the conditions ?

  • Use board's SD Card
  • LCD support enabled or disabled ?
  • Can I use Octoprint's terminal to start the print ?
  • Do I have to disconnect all serials after starting the print or can they remain connected (so I can monitor) ?

@p3p
Copy link
Member

p3p commented Jun 24, 2020

If the issue is happening with the USB CDC connection and the hardware UART along with both STM32 and LPC then I guess I can rule out it being a low level issue (in my LPC Arduino framework) that's getting one of those into a bad state, they are very different peripherals,

@kind3r I'm not sure if your board manufacturer has messed with the default bootloader or not but LPC176x should never boot loop, so if the watchdog triggers the board will be stalled in a safe state until hardware reset, in theory that rules out an actual mcu crash at least.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

@kind3r

  • Enable Linear advance.
  • Compile out LCD support. This will make the ui.update() loop faster on idle.
  • Put the test file on sd card on the printer board.
  • Connect using pronterface and use the SD button to select the file to print or do it manually using M21, M24 and filename (be warned the filesystem many need special characters to the name)
  • You should disconnect all serials , but I personally have not had any freezes even when connected for debugging using this mode.

@p3p For the STM case, I can confirm it happens regardless if you have a usb connected or not. By connecting the serial, or any other action which that delays the logic it causes it to happen way more often. With the ST-Link attached it happens in the first 10 minutes, while normally it takes a couple of prints to trigger.

@minosg
Copy link
Author

minosg commented Jun 24, 2020

Update on the investigation on this issue. I run a tracing session today, and the following race condition was observed.

Looking into the stepper.cpp logic in particular
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/stepper.cpp

When the freeze happens, the interrupts and the timers are enabled and running/firing at the appropriate rate.

The block_phase_isr() called the finalize state right before the printer froze

    .....
    /////
    // If current block is finished, reset pointer and finalize state
    if (step_events_completed >= step_event_count) {
                 discard_current_block();  
    }

discard_current_block() will set the pointer to null

    FORCE_INLINE static void discard_current_block() {
      current_block = nullptr;
      axis_did_move = 0;
      planner.release_current_block();
    }

Then the pulse_base_isr will fire next time and exit

void Stepper::pulse_phase_isr() {
   ......
   ......
  // If there is no current block, do nothing
  if (!current_block) {
      return;
  }

The next time the block_phase_isr() is called it will attempt to retrieve a new block and exit

uint32_t Stepper::block_phase_isr() {
....
  if (!current_block) {

    // Anything in the buffer?
    if ((current_block = planner.get_current_block())) {
    }
    return;
}

Planner uses this logic

block_t* Planner::get_current_block() {
  // Get the number of moves in the planner queue so far
  const uint8_t nr_moves = movesplanned();

  // If there are any moves queued ...
  if (nr_moves) {
   }
   return nullptr;

Moves planned will always return 0 since the head and tail of Planner:block_buffer are the same

  FORCE_INLINE static uint8_t movesplanned() { return BLOCK_MOD(block_buffer_head - block_buffer_tail); }

/* Trace of the public values of block buffer */
      block_buffer:16
      block_buffer_head:15
      block_buffer_nonbusy: 15
      block_buffer_planner: 15
      block_buffer_tail: 15
      cleaning_buffer_counter:0

This deadlocks the stepper logic.

I would assume that step_events_completed >= step_event_count is a normal condition, and should happen in the end of every loop. The question is why it never recovers form it.

@kind3r
Copy link

kind3r commented Jun 25, 2020

@minosg I ran the test like you said and it worked fine (from SD Card, all serials disconnected after starting the print).

I also ran another test from Octoprint via USB and it also worked fine. I think @p3p is right and it seems that my issue is related to the BTT TFT uart handling and not Marlin's since printing resumes after disconnecting and reconnecting from the TFT and I don't think Marlin restarts, it's just the TFT that stops communication.

@minosg
Copy link
Author

minosg commented Jun 28, 2020

A minor update

I have good news and bad news.

The good news. I have identified an actual bug in the planner, which is making this issue worse.
The bad news, fixing this bug does not resolve the deadlock situation, just decreases it's frequency.

Planner is slicing moves as a series of micro-moves made to fit in a ring buffer ( Default size of 16 block ) .Ring buffers and interrupts are a nasty combination since a double increment when moving the head/tail can make an empty buffer appear full or the other way around.

From the code

    /**
     * The move buffer, calculated in stepper steps
     *
     * block_buffer is a ring buffer...
     *
     *             head,tail : indexes for write,read
     *            head==tail : the buffer is empty
     *            head!=tail : blocks are in the buffer
     *   head==(tail-1)%size : the buffer is full

The good news is when this was designed a logic was included to make sure that things are delayed a bit when the buffer is working at its boundaries ( ie 15 out of 16 blocks full ). This is done by delay_before_delivering

In the beginning of each move this logic is triggered which is aimed at giving 100ms to the stepper driver to move the tail before the new line starts

 // If this is the first added movement, reload the delay, otherwise, cancel it.
  if (block_buffer_head == block_buffer_tail) {
    // If it was the first queued block, restart the 1st block delivery delay, to
    // give the planner an opportunity to queue more movements and plan them
    // As there are no queued movements, the Stepper ISR will not touch this
    // variable, so there is no risk setting this here (but it MUST be done
    // before the following line!!)
    delay_before_delivering = BLOCK_DELAY_FOR_1ST_MOVE;
  }

The problem is that the stepper will call get_current_block which is implemented as such

block_t* Planner::get_current_block() {
  // Get the number of moves in the planner queue so far
  const uint8_t nr_moves = movesplanned();

  // If there are any moves queued ...
  if (nr_moves) {

    // If there is still delay of delivery of blocks running, decrement it
    if (delay_before_delivering) {
      --delay_before_delivering;
      // If the number of movements queued is less than 3, and there is still time
      //  to wait, do not deliver anything
      if (nr_moves < 3 && delay_before_delivering) return nullptr;
      delay_before_delivering = 0;
    }

The latter is equivalent to

     if (delay_before_delivering) {
           if (nr_moves >= 3) {
              delay_before_delivering = 0;
           } else {
              return nullptr;
           }
      }

Which is effectively cancelling the delay and wasting a cycle of stepper logic if it returns null pointer

I believe it was aimed to be written as

    if (delay_before_delivering) {
      --delay_before_delivering;
      // If the number of movements queued is less than 3, and there is still time
      //  to wait, do not deliver anything
      if (nr_moves < 3 && delay_before_delivering) {
          delay_before_delivering = 0;
          return nullptr;
      }
    }

It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.

@thisiskeithb
Copy link
Member

It is advised that you apply this change to your code. It will not stop the printer freezing deadlock but will make it more smooth.

We’re accepting PRs, so please feel free to contribute your changes.

@minosg
Copy link
Author

minosg commented Jun 28, 2020

This is a minor change, and hardly a critical one. Weather it gets contributed and attributed to me or anyone who reads that comments and adds it to their already pending PR is of little importance compared to fixing the core issue.

I currently am deep down the rabbit hole of investigating this issue which effectively does not allow you to use drivers in uart mode alongside a display or octorpint. When and if that is resolved, all side findings can be merged in a larger PR without blocking valuable maintainers` time.

@kind3r
Copy link

kind3r commented Jun 28, 2020

@minosg I think you are misreading the code a bit, the ideea is to have a delay of BLOCK_DELAY_FOR_1ST_MOVE cycles if there are less than 3 moves available, else, the delay is reset to 0. Your proposed change waits for only one cycle.

@minosg
Copy link
Author

minosg commented Jun 28, 2020

I could really use any help on understanding the planner code so thanks for commenting that.

I could not determine intent of this code to begin with. Why should the delay only apply for when there are less than 3 moves planned?

During a normal operation the HEAD > TAIL and moves planned is the SIZE_OFF_BUFF -1 >= 3 so it will default to zero right after first invocation effectively using no delay and proceed to deliver the block.

If the logic intended is:

  • If delay is set and less than 3 blocks, do not return anything, and wait for it called again to decrement the counter.
  • If delay is set and more or equal 3 blocks planned, wait and then deliver ( guaranteed to give you a block )

Then the above code is not doing that
What it does is:

  • If delay is set and less than 3 blocks, do not return anything, and wait for it called again to decrement the counter.
    -If delay is set and more or equal 3 blocks planned, then do not delay and continue

From the comments in the code about this delay, it seems to indicate it was added to protect you from the ISR messing the
head/tail logic.

    // If it was the first queued block, restart the 1st block delivery delay, to
    // give the planner an opportunity to queue more movements and plan them
    // As there are no queued movements, the Stepper ISR will not touch this
    // variable, so there is no risk setting this here (but it MUST be done
    // before the following line

@thecynic
Copy link

Is there a reason why the interrupt isn't disabled while non-interrupt code manipulates the ring buffer?

@sjasonsmith
Copy link
Contributor

@boelle, @rhapsodyv and I are already heavily involved in this issue, which is entirely about serial low level race conditions. There is no need to test random things to narrow down the problem at this point.

@minosg
Copy link
Author

minosg commented Sep 13, 2020

@sjasonsmith I tried to keep the original hardware identical to the beginning of the thread. I am attaching a file with the configurations and the ox.gcode which is sliced with prusaslicer (so it contains the setttings as comments)

The board is a BTT SKR Mini e3 1.2, the Display is a Malyan LCD and the body of the printer is a MP select mini, which is a standard 3 NEMA17, 120x120cm bed size machine.

Sept_Configs_Marlin.zip

The gcode file has been used thoroughly for months since it could consistently trigger the issue. I was printing from SD card.

@sjasonsmith
Copy link
Contributor

Thanks. I think I can replicate that setup here. I don't have a Mini E3 1.2, but I have an E3 DIP which uses SoftwareSerial like it does.

@minosg
Copy link
Author

minosg commented Sep 13, 2020

@sjasonsmith .Please note that e3 1.2 uses software serial while e3 1.0 and 2.0 use hardware serial. I think that is limited to tmc stepper though, just pointing it out since it slightly adds a timing overhead

@sjasonsmith
Copy link
Contributor

Yeah, that is why I'll use my DIP instead of the e3 1.0. The DIP version uses SoftwareSerial since it has to work with TMC2208 drivers, which are not addressable.

@sjasonsmith
Copy link
Contributor

@minosg, which PlatformIO environment are you building your firmware with?

@minosg
Copy link
Author

minosg commented Sep 13, 2020

STM32F103RC_btt_512K

@sjasonsmith
Copy link
Contributor

I was able to reproduce the hang when using a Malyan LCD. As it turns out, the Malyan LCD code is accessing the Arduino framework's serial classes directly, rather than using something instantiated by Marlin. This bypasses the recent improvements @rhapsodyv made.

This explains why you saw the hang still, but other users reported it was gone. Other users had other serial-connected displays or boards connected to computers or OctoPrint over serial, all of which used the new class.

I have bypassed this so that it will use the MarlinSerial class, and am running a test print right now. Essentially the code in malyan_lcd.cpp needs to use the form MSerial2 rather than Serial2, but a bit more work is needed so that MSerial2 will be instantiated when needed.

I assume that you had to modify malyan_lcd.cpp for the display to work on your SKR Mini E3?

@minosg
Copy link
Author

minosg commented Sep 14, 2020

@sjasonsmith this bug keeps on giving. I suspected it had to do with linking or symbol overriding but didn't think to check the display driver.

To answer your question yes and no. This test was done with stock malyan_lcd as present in the github tracking, since it was aimed at testing the branch. You can use the serial pins in the EXP0 connector and it just works if you pull 3.3 and gnd from the SWD.

But in my stable branch I have heavily modified that library mostly to make it poll less frequently and be more time deterministic when api calls are made. This was because I originally though it would fix the bug.

@sjasonsmith
Copy link
Contributor

Ok, I connected mine to the TFT port, hence the need for the modification.

I’m working on a fix, but it unfortunately has to touch more than a couple lines of code.

As a temporary workaround you can change Serial1 to MSerial1 in the Malyan CPP. You will also need to modify the bottom of MarlinSerial.cpp to make sure MSerial1 is instantiated.

@sjasonsmith
Copy link
Contributor

@minosg could you test again with the changes in the PR above, #19464?

I believe this will resolve the hangs for both Malyan LCD and TMC drivers when using HardwareSerial. I was able to reproduce the hang on my machine, so I am hopeful it will be fixed for you as well.

When updating to that PR, you will see a new option MALYAN_LCD_SERIAL_PORT in the Configuration.h file. You will want to be sure that your customizations to malyan_lcd.cpp do not prevent that port from being used. It is critical that it use the MSerial1 form that my changes implement, rather than the standard Serial1.

@minosg
Copy link
Author

minosg commented Sep 22, 2020

@sjasonsmith I will be testing it as soon as possible. I am not testing with my patched Malyanlcd code, I am always using the vanilla in the PR.

I will let you know how it goes.

@Luke95vl
Copy link

Luke95vl commented Sep 22, 2020

I looked around the web for this issue and came up with this discussion.
I also have this problem, random crash of the printer while it is printing with octoprint, nothing works, not even the Touch screen or dispaly in Marlin mode, all blocked and with the heaters on until I turn the printer off and on again.

SKR E3 1.2 + BTT TFT35 E3 + Marlin 2.0.5.3 and 2.0.6.1

@rhapsodyv
Copy link
Sponsor Member

I looked around the web for this issue and came up with this discussion.
I also have this problem, random crash of the printer while it is printing with octoprint, nothing works, not even the Touch screen or dispaly in Marlin mode, all blocked and with the heaters on until I turn the printer off and on again.

SKR E3 1.2 + BTT TFT35 E3 + Marlin 2.0.5.3 and 2.0.6.1

Try marlin bugfix branch. I think it’s fixed for your setup.

@minosg
Copy link
Author

minosg commented Sep 23, 2020

@sjasonsmith I would like to report that after 24 hours of non stop testing the PR appears to fix the issue. I didn't modify the malyan lcd and just used the one in the branch, copying over the configurations .

Sir I think that this is it.....

@Incarnant
Copy link

Incarnant commented Oct 3, 2020

Hello, I believe I am experiencing the same issue. The printer will freeze at a random point during prints. This usually happened after two hours but before five hours of printing. The freeze does not happen at any consistent point. The last two freezes did happen on the shell rather than internally but I cannot remember if that was consistent previously. The one time the freeze happened in front of me there was no noise or other signs of distress the print head simply stopped moving and slowly built a cocoon of filiment around itself before stopping altogether. The heaters and power remain on after the freeze. This behavior happens regardless of the media used for printing (sd card, usb, octaprint)

After the freeze the menus seem operable at first but the printer is unresponsive to commands and the tft35 display will lose connection (I believe the connection is already lost the display has just not realized it). Rebooting the printer consistently restores functionality but the prints do not seem to resume correctly (there is usually at least a layer shift).

Ender 3 Pro
BTT SKR Mini Ee v2
BTT TFT35
Firmware updated from BugFix on 10/01/20
Display is connected by both the original cable and the BTT cable.

@sjasonsmith
Copy link
Contributor

@Incarnant can you post your configuration files?

Have you modified any files other than the two configuration files?

Is there any chance you have old and new source files mixed?

@sjasonsmith
Copy link
Contributor

@Incarnant you said firmware is from October 1 bugfix, but you mention several multi-hour failures. Did you have several failures in one day after Oct 1st, or were failures with older firmware?

@Incarnant
Copy link

@sjasonsmith I am new to Marlin and Giit so please let me know if you wanted different files. Here are the Configuration.h and nConfiguration_adv.h files.

CurrentConfig.zip

Three files were modified: Platformio.ini was modified for board type; Configuration.h was copied from the configuration examples and modified for bltouch and leveling; Configuration_adv.h was copied from the configuration examples.

I did use the configuration examples from the example configurations but those files and all of the others were freshly retrieved. I deleted the previous files prior to unzipping the freshly downloaded Marlin files so I don't believe there would have been any additional mixing of files.

I have attempted one long print since the firmware was flashed on 10/2/20 (release date of the firmware appeared to be 10/1/20) and it failed as described. The previous failures were with previous versions of the firmware (I have attempted a firmware update twice over the month that I have been aware of the problem).

@rhapsodyv
Copy link
Sponsor Member

@Incarnant do you board have onboard sd? Can do you a long print from onboard sd?

@Incarnant
Copy link

@rhapsodyv Yes, it has an onboard sd card. I will kick one off now.

@Incarnant
Copy link

Okay, some interesting results. The actual print went through to completion. However, the TFT35 display lost connection to the printer at some point. After the print was completed it still showed as active printing with the timer incrementing. The buttons were still responsive and I could click through menus but no observable response from the printer. Also, temperatures were displayed as at printing temperature but the a tual print bed and nozzle had cooled after printing.

I had a thought yesterday and realized I should ask. Do I need to patch my maple library for the bugfix firmware to work?

@sjasonsmith
Copy link
Contributor

@Incarnant no, you do not need to modify Maple. Marlin now overrides the Maple interrupt handler discussed in this issue with a version that doesn’t hand and supports an emergency parser.

Your issue is something different than this bug report. This bug causes the entire printer to hang, it could not continue printing if the error occurred.

I am going to close this issue, since we believe it is resolved and doesn’t match the problem you currently see. Your problem might be an issue with the TFT firmware, or might be another Marlin issue not directly related to this one.

@thisiskeithb thisiskeithb removed the no-locking Prevent Lock-bot from locking the issue. label Mar 30, 2021
@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators May 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests