Skip to content

Conversation

@tpambor
Copy link
Contributor

@tpambor tpambor commented Aug 13, 2025

Add support for the interrupt-driven API. Interrupts are emulated using a polling thread.

This was merged in a previous version as #93957 but caused problems, see #94425 and reverted in #94426.

The issue stemmed from false RX interrupts being triggered when select indicated that data was available to read, but the subsequent read operation failed. In the updated implementation, a dedicated polling thread now feeds incoming data into a ring buffer. Interrupts are no longer triggered directly by select; instead, the ring buffer is used to signal valid RX interrupts, ensuring more reliable and accurate handling.

(Locally) this is now passing samples/sensor/sensor_shell and tests/drivers/can/host, which previously failed in #94425.

@zephyrbot zephyrbot added area: native port Host native arch port (native_sim) area: UART Universal Asynchronous Receiver-Transmitter area: CAN labels Aug 13, 2025
@tpambor tpambor force-pushed the pty-irq-2 branch 3 times, most recently from 1aa7fbf to ff8994d Compare August 13, 2025 21:10
@dcpleung
Copy link
Member

I wonder why CI didn't catch it last time.

@tpambor
Copy link
Contributor Author

tpambor commented Aug 13, 2025

I wonder why CI didn't catch it last time.

tests/drivers/can/host isn't run in CI.

For samples/sensor/sensor_shell, there was also 7c0e4ae, which helped with the symptoms and was sufficient to pass. With this PR, the sensor shell would actually pass with 7c0e4ae reverted but I think it doesn't do no bad, so there is no reason to revert.

Copy link
Member

@henrikbrixandersen henrikbrixandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes tests using the native_sim UART with pytest and the twister_harness Shell much slower than before.

As an example, running tests/drivers/can/host without this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 1.865s <host>)

But with this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 40.506s <host>)

@tpambor
Copy link
Contributor Author

tpambor commented Aug 14, 2025

This change makes tests using the native_sim UART with pytest and the twister_harness Shell much slower than before.

As an example, running tests/drivers/can/host without this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 1.865s <host>)

But with this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 40.506s <host>)

This performance degradation is due to a locking issue in the shell code. Enabling CONFIG_LOG=y and by default CONFIG_SHELL_LOG_BACKEND=y which changes how locking in shell works, results in performance comparable to the polling implementation.

@tpambor tpambor requested a review from jakub-uC August 14, 2025 08:45
Copy link
Member

@aescolar aescolar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach is too complicated and heavy (i.e. no need for ring buffers).
I think instead something that would work:
When we want to check if there is something ready in stdin, we read 1 character.
If we read() == 0, we know we reached EOF. We flag it and never try to read again.
If we read() == -1, we know there was nothing there yet.
if we read() == 1, we save that character we just read ahead in data (data->char_read_ahead or whatever), and set a flag (data->stdin_ready_char or whatever).
Next time we want to check if there is data we first check this flag (if the flag is set we already answer yes, otherwise we do the check with one character above).

When there is an attempt to actually read data by the user (len > 0).
If that flag (data->stdin_ready_char) was set we first copy that character to the user buffer, clear data->stdin_ready_char, and continue trying to read len-1.
if the flag was not set, we attempt to read len. If we get read() == 0 => we flag eof.

That should solve the issue, and reduce a lot the number of syscalls.
We should probably check if we need to do the select() at all. Or if we can just use the read() to detect if data is available in stdin or not. (for a real pty we don't need the select).

@ghost
Copy link

ghost commented Aug 14, 2025

I think this approach is too complicated and heavy (i.e. no need for ring buffers). I think instead something that would work: When we want to check if there is something ready in stdin, we read 1 character. If we read() == 0, we know we reached EOF. We flag it and never try to read again. If we read() == -1, we know there was nothing there yet. if we read() == 1, we save that character we just read ahead in data (data->char_read_ahead or whatever), and set a flag (data->stdin_ready_char or whatever). Next time we want to check if there is data we first check this flag (if the flag is set we already answer yes, otherwise we do the check with one character above).

When there is an attempt to actually read data by the user (len > 0). If that flag (data->stdin_ready_char) was set we first copy that character to the user buffer, clear data->stdin_ready_char, and continue trying to read len-1. if the flag was not set, we attempt to read len. If we get read() == 0 => we flag eof.

That should solve the issue, and reduce a lot the number of syscalls. We should probably check if we need to do the select() at all. Or if we can just use the read() to detect if data is available in stdin or not. (for a real pty we don't need the select).

How does removing the ring-buffer help. Would we not expect better performance with the ring-buffer (for example less system calls?) How would your approach help?

The performance issue may be unrelated to the driver.

Some thoughts:

  1. Maybe someone could make an analysis using tracing
  2. can_shell is storing the const struct shell * and using it later from the work-queue to implement "async"-shell commands. This seems a bit unique to me, not sure if that way of doing "async"-shell commands is supported/recommended. An attempt to make them synchronous was rejected.
  3. There is probably a deadlock (or live lock) situation in the can_shell, that can be worked around by CONFIG_SHELL_TX_TIMEOUT_MS=0 but needs actual fixing due degrading performance with that approach, when CONFIG_LOG is disabled. Without CONFIG_SHELL_TX_TIMEOUT_MS=0 we have the deadlock with CONFIG_LOG disabled.

@tpambor
Copy link
Contributor Author

tpambor commented Aug 15, 2025

I identified the performance problem. The sleep in the polling thread caused high latency for the emulated TX interrupt used to write data. This, together with short writes issued by the shell backend (max. 8 bytes), resulted in throttling TX performance to <400 Bps.

This caused the shell to be unable to keep up, which manifested, among other things, in failures to obtain the shell lock and dropped messages.

The issue is now resolved by waiting for either the TX IRQ to be activated or a short period of time to poll for new input data.

In addition, I simplified the poll_in functions and dropped the select call, so that this works now only using read.

I tried the following tests with this PR

  • tests/drivers/can/host - Tests pass, performance improved over polling
  • samples/sensor/sensor_shell - Tests pass
  • samples/sensor/sensor_shell with 7c0e4ae reverted - Tests pass
  • (sleep 0.01; echo -e "help\n") | ./build/zephyr/zephyr.exe -uart_stdinout | head -n 80 - works as expected
  • samples/subsys/shell/shell_module + echo -e "help\n" | ./build/zephyr/zephyr.exe -uart_stdinout | head -n 80 - no output, same as before, see Shell: Race during init w CONFIG_SHELL_BACKEND_SERIAL_API_INTERRUPT_DRIVEN=y #94436
  • samples/subsys/shell/shell_module + ./build/zephyr/zephyr.exe -uart_stdinout - shell works over stdio as expected
  • samples/subsys/shell/shell_module + ./build/zephyr/zephyr.exe - shell works over pty as expected

@henrikbrixandersen I would appreciate if you can verify the performance of tests/drivers/can/host from your side

@henrikbrixandersen
Copy link
Member

The updated patch greatly improves the performance, but it is still significantly slower than when using polling (which seems odd from a user perspective):

As an example, running tests/drivers/can/host without this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 1.865s <host>)

But with this updated patch:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 3.797s <host>)

@tpambor
Copy link
Contributor Author

tpambor commented Aug 15, 2025

The updated patch greatly improves the performance, but it is still significantly slower than when using polling (which seems odd from a user perspective):

As an example, running tests/drivers/can/host without this change:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 1.865s <host>)

But with this updated patch:

INFO - 1/1 native_sim/native/64 drivers.can.host PASSED (native 3.797s <host>)

This behavior isn’t too surprising, under the hood, the IRQ emulation also relies on polling. However, it introduces an additional abstraction layer, which leads to significantly more context switches. Data is transferred in blocks of up to 8 bytes between the shell task and the IRQ emulation task, which then polls the data out.

That said, this PR is not focused on performance improvements. Its goal is to enhance feature completeness, enabling testing of device drivers or other functionality that only supports IRQ mode.

If shell performance is a concern for this sample, one can set SHELL_BACKEND_SERIAL_INTERRUPT_DRIVEN=n. Alternatively, we could consider making this the default for native-sim, so there would be no performance regressions by default, and users could opt in to use IRQ mode when needed.

@henrikbrixandersen
Copy link
Member

That said, this PR is not focused on performance improvements. Its goal is to enhance feature completeness, enabling testing of device drivers or other functionality that only supports IRQ mode.

Sure, but if introducing a new feature negatively impacts performance of an existing feature, we cannot just pull it in and say we'll address the introduced performance issue at a later time.

If shell performance is a concern for this sample, one can set SHELL_BACKEND_SERIAL_INTERRUPT_DRIVEN=n. Alternatively, we could consider making this the default for native-sim, so there would be no performance regressions by default, and users could opt in to use IRQ mode when needed.

I reckon most developers will be using the native_sim UART for logging and shell functionality, so defaulting to polling there might not be a bad idea. I'll leave that up to @aescolar.

@aescolar
Copy link
Member

@tpambor thanks. I was out yesterday, I will take a look at this next week and probably ping you in discord to ask about the motivation for some of the choices.

@aescolar
Copy link
Member

aescolar commented Aug 18, 2025

Thanks @tpambor I tried to optimize the code a bit here: https://github.com/aescolar/zephyr/tree/pr_94478
and fix a couple of issues. The changes are:

  • Not change stdin and stdout to non-blocking
  • Not start/stop the thread when there is no work, but instead let it sleep for ever/wake it. (start/stoping threads has a quite big overhead). Removed the semaphore and just k_sleep and k_wakeup instead.
  • The irq thread loop tries to be a bit leaner and hopefully waits more when possible
  • The wait time is now 10ms instead of 1ms when there is no data (20ms or even higher would also be ok: if there was no data, the chance of having data in a couple of ms is very low. And anyhow, the default tick period is 10ms)
  • Removed the ringbuffer use and instead try to read only one char ahead to detect data/eof. The greatest majority (~99%) of the attempts there won't be anything to read, and when there is, it would be better to let the user read as much data as it can on its own.
  • The WARNing on not having callback is changed to an error (otherwise I'd expect an infinite loop in that case)
  • The fifo writing now writes once n bytes instead of looping per byte.
  • Once stdin disconnects, it stops trying to do any rx work all together on it (incl. scheduling the thread)
  • (Compared to the original code: select is replaced with poll so check for data takes 1/3 of the cycles)

With these changes that CAN test is "only" ~75% slower in real time (with the native_sim code using >2x the instructions).
The way the shell runs in IRQ mode is quite heavy overall.

@aescolar
Copy link
Member

If you want to profile it you can do

diff --git a/scripts/pylib/pytest-twister-harness/src/twister_harness/device/binary_adapter.py b/scripts/pylib/pytest-twister-harness/src/twister_harness/device/binary_adapter.py
index a6fe30da9ec..00ad7304878 100755
--- a/scripts/pylib/pytest-twister-harness/src/twister_harness/device/binary_adapter.py
+++ b/scripts/pylib/pytest-twister-harness/src/twister_harness/device/binary_adapter.py
@@ -43,7 +43,9 @@ class BinaryAdapterBase(DeviceAdapter, abc.ABC):
             msg = 'Run command is empty, please verify if it was generated properly.'
             logger.error(msg)
             raise TwisterHarnessException(msg)
+        self.command= ["valgrind"] + ["--tool=callgrind"] + self.command
         log_command(logger, 'Running command', self.command, level=logging.DEBUG)
+
         try:
             self._process = subprocess.Popen(self.command, **self.process_kwargs)
         except subprocess.SubprocessError as exc:

Run the test normally (after setting up things as described in https://github.com/zephyrproject-rtos/zephyr/blob/main/tests/drivers/can/host/README.rst#running-on-native_sim ) with
twister -v -p native_sim/native/64 -X can:zcan0 -T tests/drivers/can/host/ -n
find the latest callgrind output:
find twister-out -name "callgrind*"
and open it with kcachegrind
kcachegrind ./twister-out/native_sim_native_64/host/tests/drivers/can/host/drivers.can.host/callgrind.out.xxxxxx &

@aescolar
Copy link
Member

aescolar commented Aug 18, 2025

Alternatively, we could consider making this the default for native-sim, so there would be no performance regressions by default, and users could opt in to use IRQ mode when needed.

I think that unless we can improve the IRQ mode performance to be practically the same as the polling mode (which I doubt) we will want to do this.

@aescolar
Copy link
Member

aescolar commented Aug 18, 2025

Here profiling of that tests/drivers/can/host/ showing the thread loop, note the amount of calls from the shell into the driver for Tx and so forth:
image

@ghost
Copy link

ghost commented Aug 19, 2025

Thank you @aescolar @henrikbrixandersen @tpambor for working on this.
So the approach would be to merge your developements of the irq-mode @aescolar but make it default to polling mode on native_sim so as not to impact performance of most users, that do not want to explicitly test a driver that requires irq-mode?

@aescolar
Copy link
Member

So the approach would be to merge your developements of the irq-mode @aescolar but make it default to polling mode on native_sim so as not to impact performance of most users, that do not want to explicitly test a driver that requires irq-mode?

I thinks so yes.

@zephyrbot zephyrbot added the area: Shell Shell subsystem label Aug 19, 2025
@zephyrbot zephyrbot requested a review from carlescufi August 19, 2025 09:46
@tpambor
Copy link
Contributor Author

tpambor commented Aug 19, 2025

Thanks for the optimizations @aescolar. I pulled your changes and added another commit on top to change the shell by default to polling mode for uart_native_pty.

Add support for the interrupt-driven API. Interrupts are
emulated using a polling thread.

Signed-off-by: Tim Pambor <tim.pambor@codewrights.de>
Signed-off-by: Alberto Escolar Piedras <alberto.escolar.piedras@nordicsemi.no>
The interrupt-driven UART API is emulated via polling on native_sim,
which introduces additional overhead. Defaulting to poll mode improves
performance by avoiding this emulation cost.

Signed-off-by: Tim Pambor <tim.pambor@codewrights.de>
@sonarqubecloud
Copy link

@aescolar
Copy link
Member

aescolar commented Aug 19, 2025

@dcpleung @jakub-uC given how the shell interacts with this driver in irq mode, I wonder if there is something misunderstood about the driver API in either side..

@dcpleung
Copy link
Member

@dcpleung @jakub-uC given how the shell interacts with this driver in irq mode, I wonder if there is something misunderstood about the driver API in either side..

Come to think of it, maybe using UART ASYNC API would be better for this then the pure interrupt API?

@kartben kartben merged commit e3a99bd into zephyrproject-rtos:main Aug 19, 2025
34 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: CAN area: native port Host native arch port (native_sim) area: Shell Shell subsystem area: UART Universal Asynchronous Receiver-Transmitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants