-
-
Notifications
You must be signed in to change notification settings - Fork 7k
Fix for Serial::flush() deadlock with high baud rates #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looking at the code, your observation and fix look correct to me. However, I recently created a series of patches to HardwareSerial, which also change the code related to these registers. Would you care to test that code and report if the problem still occurs there (I think the race condition still exists, would be good to have it confirmed). The code is in #1369. If the problem still exists, I think I know how to fix it in my changed code as well, I'll try to incorporate a fix there, then. |
Hello Matthijs, thank you for the prompt answer. I could not get the above sketch to cause a deadlock (also with different data sizes), however looking at your code, the race condition might still be there... Humm... By the way, your current branch did not compile, because it could not find |
Thinking about this a bit more, I'm not sure if first clearing TXC and then writing to UDR is a sufficient fix. I think the following can happen:
I haven't looked closely if this problem can actually appear in the current HardwareSerial code, but I'm pretty sure it can happen with my HardwareSerial changes. This needs a bit more thinking to get it right, I just wanted to note it down for now, lest I forget. |
Thinking on this a bit more, the fix is simple: Simply clear TXC directly after writing to UDR. This technically still leaves a race condition: If the data byte written to UDR is sent out (causing TXC to be set) before TXC is cleared and then the hardware is done, but TXC is cleared. However, we can assume that sending out a data byte will take longer than the one or two cycles needed to clear the TXC bit, so this race can never occur. |
Ah, and that is already what the patch in my pullrequest does, so that explains why you couldn't reproduce the race condition using my branch. Do you agree with my analysis, or did I perhaps miss something? |
A bit of testing supports the analysis: If I delay for 10μs (startbit, 8 databits, stopbit) between writing UDR and clearing TXC, the Arduino deadlocks. A delay of 9μs works like a charm. |
…hile It turns out there is an additional corner case. The analysis in the previous commit wrt to flush() assumes that the data register is always kept filled by the interrupt handler, so the TXC bit won't get set until all the queued bytes have been transmitted. But, when interrupts are disabled for a longer period (for example when an interrupt handler for another device is running for longer than 1-2 byte times), it could happen that the UART stops transmitting while there are still more bytes queued (but these are in the buffer, not in the UDR register, so the UART can't know about them). In this case, the TXC bit would get set, but the transmission is not complete yet. We can easily detect this case by looking at the head and tail pointers, but it seems easier to instead look at the UDRIE bit (the TX interrupt is enabled if and only if there are bytes in the queue). To fix this corner case, this commit: - Checks the UDRIE bit and only if it is unset, looks at the TXC bit. - Moves the clearing of TXC from write() to the tx interrupt handler. This (still) causes the TXC bit to be cleared whenever a byte is queued when the buffer is empty (in this case the tx interrupt will trigger directly after write() is called). It also causes the TXC bit to be cleared whenever transmission is resumed after it halted because interrupts have been disabled for too long. As a side effect, another race condition is prevented. This could occur at very high bitrates, where the transmission would be completed before the code got time to clear the TXC0 register, making the clear happen _after_ the transmission was already complete. With the new code, the clearing of TXC happens directly after writing to the UDR register, while interrupts are disabled, and we can be certain the data transmission needs more time than one instruction to complete. This fixes arduino#1463 and replaces arduino#1456.
…hile It turns out there is an additional corner case. The analysis in the previous commit wrt to flush() assumes that the data register is always kept filled by the interrupt handler, so the TXC bit won't get set until all the queued bytes have been transmitted. But, when interrupts are disabled for a longer period (for example when an interrupt handler for another device is running for longer than 1-2 byte times), it could happen that the UART stops transmitting while there are still more bytes queued (but these are in the buffer, not in the UDR register, so the UART can't know about them). In this case, the TXC bit would get set, but the transmission is not complete yet. We can easily detect this case by looking at the head and tail pointers, but it seems easier to instead look at the UDRIE bit (the TX interrupt is enabled if and only if there are bytes in the queue). To fix this corner case, this commit: - Checks the UDRIE bit and only if it is unset, looks at the TXC bit. - Moves the clearing of TXC from write() to the tx interrupt handler. This (still) causes the TXC bit to be cleared whenever a byte is queued when the buffer is empty (in this case the tx interrupt will trigger directly after write() is called). It also causes the TXC bit to be cleared whenever transmission is resumed after it halted because interrupts have been disabled for too long. As a side effect, another race condition is prevented. This could occur at very high bitrates, where the transmission would be completed before the code got time to clear the TXC0 register, making the clear happen _after_ the transmission was already complete. With the new code, the clearing of TXC happens directly after writing to the UDR register, while interrupts are disabled, and we can be certain the data transmission needs more time than one instruction to complete. This fixes arduino#1463 and replaces arduino#1456.
Fixed using Matthijs' patch. Release scheduled for 1.5.6. |
I am currently developing a couple of Arduino Pro Mini (Atmega 328 @16mhz) based motor controllers that get their commands via RS-485. For that reason I have to switch the bus transceiver to transmit whenever one of my nodes sends something back.
Everything works nicely, however
HardwareSerial::flush()
occasionally locks up when I am using high baudrates (1.000.000 Bit/s).I am not evening using interrupts or interrupt timers, but have found the problem to be in
HardwareSerial::send()
. When sending bytes, the interrupt forUDRE
is enabled before theTXC0
is cleared (set to one). If the interrupt fires fast enough, consumes and shifts out the byte, theTXC0
bit is cleared after it is set when the last byte has been shifted out. CallingSerial.flush()
in a case like this causes the Arduino to lock up, because it is stuck in itswhile
loop indefinitely.Here is a small sketch that will crash an Atmega 328 @16mhz after a couple of seconds:
Clearing the
TXC0
bit before enabling the interrupt inHardwareSerial::send
makes the problem go away (see commit).