Skip to content

Fix for Serial::flush() deadlock with high baud rates #1456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

allesblinkt
Copy link

I am currently developing a couple of Arduino Pro Mini (Atmega 328 @16mhz) based motor controllers that get their commands via RS-485. For that reason I have to switch the bus transceiver to transmit whenever one of my nodes sends something back.

Everything works nicely, however HardwareSerial::flush() occasionally locks up when I am using high baudrates (1.000.000 Bit/s).

I am not evening using interrupts or interrupt timers, but have found the problem to be in HardwareSerial::send(). When sending bytes, the interrupt for UDRE is enabled before the TXC0 is cleared (set to one). If the interrupt fires fast enough, consumes and shifts out the byte, the TXC0 bit is cleared after it is set when the last byte has been shifted out. Calling Serial.flush() in a case like this causes the Arduino to lock up, because it is stuck in its while loop indefinitely.

Here is a small sketch that will crash an Atmega 328 @16mhz after a couple of seconds:


#define DIR_PIN 10
#define LED_PIN 13

#define MAX_DATA_LENGTH 6

byte data[] = {
  0xFF, 0xFF, 0xAB, 0xCD, 0xDE, 0x66};

void setup() { 
  Serial.begin(1000000);

  pinMode(DIR_PIN, OUTPUT);
  pinMode(LED_PIN, OUTPUT);
  digitalWrite(DIR_PIN, LOW); // receive
}


void loop() {
  digitalWrite(LED_PIN, HIGH);

  digitalWrite(DIR_PIN, HIGH); // send
  Serial.write(data, MAX_DATA_LENGTH);
  Serial.flush();
  digitalWrite(DIR_PIN, LOW); // receive

  delay(100);
  digitalWrite(LED_PIN, LOW);
  delay(100);
}

Clearing the TXC0 bit before enabling the interrupt in HardwareSerial::send makes the problem go away (see commit).

@matthijskooijman
Copy link
Collaborator

Looking at the code, your observation and fix look correct to me.

However, I recently created a series of patches to HardwareSerial, which also change the code related to these registers. Would you care to test that code and report if the problem still occurs there (I think the race condition still exists, would be good to have it confirmed). The code is in #1369.

If the problem still exists, I think I know how to fix it in my changed code as well, I'll try to incorporate a fix there, then.

@allesblinkt
Copy link
Author

Hello Matthijs,

thank you for the prompt answer. I could not get the above sketch to cause a deadlock (also with different data sizes), however looking at your code, the race condition might still be there... Humm...

By the way, your current branch did not compile, because it could not find avr/cpufunc.h. For testing I fixed this by including compat/ina90.h which also defines the _NOP macro.

@matthijskooijman
Copy link
Collaborator

Thinking about this a bit more, I'm not sure if first clearing TXC and then writing to UDR is a sufficient fix. I think the following can happen:

  • The hardware is writing out a byte (which is no longer in UDR)

  • TXC is cleared

  • The hardware finishes writing and sets TXC

  • UDR is written

    Now, the hardware is still writing a byte, but TXC indicates transmission is complete.

I haven't looked closely if this problem can actually appear in the current HardwareSerial code, but I'm pretty sure it can happen with my HardwareSerial changes. This needs a bit more thinking to get it right, I just wanted to note it down for now, lest I forget.

@matthijskooijman
Copy link
Collaborator

Thinking on this a bit more, the fix is simple: Simply clear TXC directly after writing to UDR. This technically still leaves a race condition: If the data byte written to UDR is sent out (causing TXC to be set) before TXC is cleared and then the hardware is done, but TXC is cleared. However, we can assume that sending out a data byte will take longer than the one or two cycles needed to clear the TXC bit, so this race can never occur.

@matthijskooijman
Copy link
Collaborator

Ah, and that is already what the patch in my pullrequest does, so that explains why you couldn't reproduce the race condition using my branch.

Do you agree with my analysis, or did I perhaps miss something?

@matthijskooijman
Copy link
Collaborator

A bit of testing supports the analysis: If I delay for 10μs (startbit, 8 databits, stopbit) between writing UDR and clearing TXC, the Arduino deadlocks. A delay of 9μs works like a charm.

matthijskooijman added a commit to matthijskooijman/Arduino that referenced this pull request Dec 18, 2013
…hile

It turns out there is an additional corner case. The analysis in the
previous commit wrt to flush() assumes that the data register is always
kept filled by the interrupt handler, so the TXC bit won't get set until
all the queued bytes have been transmitted. But, when interrupts are
disabled for a longer period (for example when an interrupt handler for
another device is running for longer than 1-2 byte times), it could
happen that the UART stops transmitting while there are still more bytes
queued (but these are in the buffer, not in the UDR register, so the
UART can't know about them).

In this case, the TXC bit would get set, but the transmission is not
complete yet. We can easily detect this case by looking at the head and
tail pointers, but it seems easier to instead look at the UDRIE bit
(the TX interrupt is enabled if and only if there are bytes in the
queue). To fix this corner case, this commit:
 - Checks the UDRIE bit and only if it is unset, looks at the TXC bit.
 - Moves the clearing of TXC from write() to the tx interrupt handler.
   This (still) causes the TXC bit to be cleared whenever a byte is
   queued when the buffer is empty (in this case the tx interrupt will
   trigger directly after write() is called). It also causes the TXC bit
   to be cleared whenever transmission is resumed after it halted
   because interrupts have been disabled for too long.

As a side effect, another race condition is prevented. This could occur
at very high bitrates, where the transmission would be completed before
the code got time to clear the TXC0 register, making the clear happen
_after_ the transmission was already complete. With the new code, the
clearing of TXC happens directly after writing to the UDR register,
while interrupts are disabled, and we can be certain the data
transmission needs more time than one instruction to complete. This
fixes arduino#1463 and replaces arduino#1456.
cmaglie pushed a commit to cmaglie/Arduino that referenced this pull request Jan 22, 2014
…hile

It turns out there is an additional corner case. The analysis in the
previous commit wrt to flush() assumes that the data register is always
kept filled by the interrupt handler, so the TXC bit won't get set until
all the queued bytes have been transmitted. But, when interrupts are
disabled for a longer period (for example when an interrupt handler for
another device is running for longer than 1-2 byte times), it could
happen that the UART stops transmitting while there are still more bytes
queued (but these are in the buffer, not in the UDR register, so the
UART can't know about them).

In this case, the TXC bit would get set, but the transmission is not
complete yet. We can easily detect this case by looking at the head and
tail pointers, but it seems easier to instead look at the UDRIE bit
(the TX interrupt is enabled if and only if there are bytes in the
queue). To fix this corner case, this commit:
 - Checks the UDRIE bit and only if it is unset, looks at the TXC bit.
 - Moves the clearing of TXC from write() to the tx interrupt handler.
   This (still) causes the TXC bit to be cleared whenever a byte is
   queued when the buffer is empty (in this case the tx interrupt will
   trigger directly after write() is called). It also causes the TXC bit
   to be cleared whenever transmission is resumed after it halted
   because interrupts have been disabled for too long.

As a side effect, another race condition is prevented. This could occur
at very high bitrates, where the transmission would be completed before
the code got time to clear the TXC0 register, making the clear happen
_after_ the transmission was already complete. With the new code, the
clearing of TXC happens directly after writing to the UDR register,
while interrupts are disabled, and we can be certain the data
transmission needs more time than one instruction to complete. This
fixes arduino#1463 and replaces arduino#1456.
@cmaglie
Copy link
Member

cmaglie commented Jan 28, 2014

Fixed using Matthijs' patch. Release scheduled for 1.5.6.

@cmaglie cmaglie closed this Jan 28, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants