Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve interrupt latency (especially on Xtensa) #1162

Closed
3 of 5 tasks
bjoernQ opened this issue Feb 13, 2024 · 19 comments
Closed
3 of 5 tasks

Improve interrupt latency (especially on Xtensa) #1162

bjoernQ opened this issue Feb 13, 2024 · 19 comments
Assignees
Labels
performance Performance seems to be not as good as it could be
Milestone

Comments

@bjoernQ
Copy link
Contributor

bjoernQ commented Feb 13, 2024

Actionable items from this investigation:

Original issue:

With these two non-scientifical tests

Vectored

#![no_std]
#![no_main]

use core::cell::RefCell;

use critical_section::Mutex;
use esp32s3_hal::{
    clock::ClockControl,
    interrupt::{self},
    peripherals::{self, Peripherals},
    prelude::*,
    system::{SoftwareInterrupt, SoftwareInterruptControl},
    Delay,
};
use esp_backtrace as _;

static SWINT: Mutex<RefCell<Option<SoftwareInterruptControl>>> = Mutex::new(RefCell::new(None));

#[entry]
fn main() -> ! {
    let peripherals = Peripherals::take();
    let system = peripherals.SYSTEM.split();
    let sw_int = system.software_interrupt_control;
    let clocks = ClockControl::boot_defaults(system.clock_control).freeze();

    critical_section::with(|cs| SWINT.borrow_ref_mut(cs).replace(sw_int));


    interrupt::enable(
        peripherals::Interrupt::FROM_CPU_INTR0,
        interrupt::Priority::Priority3,
    )
    .unwrap();

    let mut delay = Delay::new(&clocks);
    loop {
        delay.delay_ms(500u32);
        critical_section::with(|cs| {
            unsafe {
                CLCK = esp32s3_hal::xtensa_lx::timer::get_cycle_count();
            }
            SWINT
                .borrow_ref_mut(cs)
                .as_mut()
                .unwrap()
                .raise(SoftwareInterrupt::SoftwareInterrupt0);
        });
    }
}

static mut CLCK: u32 = 0u32;

#[interrupt]
fn FROM_CPU_INTR0() {
    let diff = unsafe { esp32s3_hal::xtensa_lx::timer::get_cycle_count() - CLCK };
    esp_println::println!("diff = {}", diff);
    critical_section::with(|cs| {
        SWINT
            .borrow_ref_mut(cs)
            .as_mut()
            .unwrap()
            .reset(SoftwareInterrupt::SoftwareInterrupt0);
    });
}

Non-vectored

#![no_std]
#![no_main]

use core::cell::RefCell;

use critical_section::Mutex;
use esp32s3_hal::{
    clock::ClockControl,
    interrupt::{self},
    peripherals::{self, Peripherals},
    prelude::*,
    system::{SoftwareInterrupt, SoftwareInterruptControl},
    Delay,
};
use esp_backtrace as _;

static SWINT: Mutex<RefCell<Option<SoftwareInterruptControl>>> = Mutex::new(RefCell::new(None));

#[entry]
fn main() -> ! {
    let peripherals = Peripherals::take();
    let system = peripherals.SYSTEM.split();
    let sw_int = system.software_interrupt_control;
    let clocks = ClockControl::boot_defaults(system.clock_control).freeze();

    critical_section::with(|cs| SWINT.borrow_ref_mut(cs).replace(sw_int));

    unsafe {
        interrupt::disable(esp32s3_hal::get_core(), peripherals::Interrupt::SDIO_HOST);

        interrupt::map(esp32s3_hal::get_core(), peripherals::Interrupt::FROM_CPU_INTR0, interrupt::CpuInterrupt::Interrupt23LevelPriority3);
        esp32s3_hal::xtensa_lx::interrupt::enable_mask(
            esp32s3_hal::xtensa_lx::interrupt::get_mask() | 1 << interrupt::CpuInterrupt::Interrupt23LevelPriority3 as u32,
        );
        
    }

    let mut delay = Delay::new(&clocks);
    loop {
        delay.delay_ms(500u32);
        critical_section::with(|cs| {
            unsafe {
                CLCK = esp32s3_hal::xtensa_lx::timer::get_cycle_count();
            }
            SWINT
                .borrow_ref_mut(cs)
                .as_mut()
                .unwrap()
                .raise(SoftwareInterrupt::SoftwareInterrupt0);
        });
    }
}

static mut CLCK: u32 = 0u32;

#[no_mangle]
extern "C" fn level3_interrupt() {
    let diff = unsafe { esp32s3_hal::xtensa_lx::timer::get_cycle_count() - CLCK };
    esp_println::println!("diff = {}", diff);
    critical_section::with(|cs| {
        SWINT
            .borrow_ref_mut(cs)
            .as_mut()
            .unwrap()
            .reset(SoftwareInterrupt::SoftwareInterrupt0);
    });
}

I get these results

with saving floats
vectored = 1111 cycles
non-vectored = 214

w/o saving floats
vectored = 1096
non-vectored = 199

w/o saving floats, w/o spilling registers
vectored = 1043
non-vectored = 146

Latency is more than five times with vectoring enabled.

While it's more or less possible to use vectored and non-vectored interrupts together on RISC-V that is not possible in a sane way for Xtensa currently. This hurts esp-wifi a lot but also hurts async performance.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Feb 13, 2024

While it's probably still worth to optimize things, regarding esp-wifi on Xtensa this might be a red herring.

Replacing/adding this to the top of handle_interrupts should be the most naive and fasted dispatching to the handlers:

        if level == 1 {
            if interrupt::get() & (1 << 6) != 0 {
                interrupt::clear(1 << 6);
                xtensa_lx_rt::Timer0(1, save_frame);
            }
        }

        if level == 3 {
            if interrupt::get() & (1 << 29) != 0 {
                interrupt::clear(1 << 29);
                xtensa_lx_rt::Software1(3, save_frame);
            }
        }

        let status = get_status(crate::get_core());
        if status == 0 {
            return;
        }

        if level == 1 {
            if status & (1 << Interrupt::WIFI_MAC as u128) != 0 {
                handle_interrupt(1, Interrupt::WIFI_MAC, save_frame);
            }
            if status & (1 << Interrupt::WIFI_PWR as u128) != 0 {
                handle_interrupt(1, Interrupt::WIFI_PWR, save_frame);
            }
        }

        if level == 2 {
            if status & (1 << Interrupt::TG1_T0_LEVEL as u128) != 0 {
                handle_interrupt(2, Interrupt::TG1_T0_LEVEL, save_frame);
            }
        }

        if level == 3 {
            if status & (1 << Interrupt::FROM_CPU_INTR0 as u128) != 0 {
                handle_interrupt(3, Interrupt::FROM_CPU_INTR0, save_frame);
            }

            if status & (1 << Interrupt::TG0_T0_LEVEL as u128) != 0 {
                handle_interrupt(3, Interrupt::TG0_T0_LEVEL, save_frame);
            }
        }

        return;

but using this with esp-wifi doesn't improve the benchmark results (seems it gets even worse)

@MabezDev
Copy link
Member

Do you see the same results on the esp32 or esp32s2? In my testing is see much worse results on the s3, which makes me think that we have a general xtensa issue, and a esp32s3 specific issue. Maybe this might help the other chips?

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Feb 13, 2024

Tried on ESP32 and it doesn't seem to make a difference there, too. Unfortunately, the results vary a lot for each run even if I don't change anything so even if there are minor differences in the numbers it might not be caused by any code change

On S2 I currently cannot get any benchmark example to work on S2 (even without my changes)

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Feb 14, 2024

Maybe my results were not correct yesterday - testing again today using my Android phone as the bench-server (https://github.com/bjoernQ/android-tcp-bench-server) shows a clear indication that at least the download rates get much better, (tested on S3, again)

Original

connecting to Address([192, 168, 12, 44]):4321...
connected, testing...
download: 676 kB/s
Testing upload...
connecting to Address([192, 168, 12, 44]):4322...
connected, testing...
upload: 649 kB/s
Testing upload+download...
connecting to Address([192, 168, 12, 44]):4323...
connected, testing...
upload+download: 533 kB/s
Testing download...
connecting to Address([192, 168, 12, 44]):4321...
connected, testing...
download: 667 kB/s
Testing upload...
connecting to Address([192, 168, 12, 44]):4322...
connected, testing...
upload: 645 kB/s
Testing upload+download...
connecting to Address([192, 168, 12, 44]):4323...
connected, testing...

With the optimized/hacked interrupt handling

Testing download...
connecting to Address([192, 168, 12, 44]):4321...
connected, testing...
download: 931 kB/s
Testing upload...
connecting to Address([192, 168, 12, 44]):4322...
connected, testing...
upload: 651 kB/s
Testing upload+download...
connecting to Address([192, 168, 12, 44]):4323...
connected, testing...
upload+download: 519 kB/s
Testing download...
connecting to Address([192, 168, 12, 44]):4321...
connected, testing...
download: 916 kB/s
Testing upload...
connecting to Address([192, 168, 12, 44]):4322...
connected, testing...
upload: 653 kB/s
Testing upload+download...
connecting to Address([192, 168, 12, 44]):4323...
connected, testing...
upload+download: 529 kB/s
Testing download...
connecting to Address([192, 168, 12, 44]):4321...
connected, testing...
download: 919 kB/s
Testing upload...
connecting to Address([192, 168, 12, 44]):4322...
connected, testing...

@MabezDev
Copy link
Member

Just for reference (on my network), current main with an esp32:

Got IP: 192.168.0.106/24
Testing download...
connecting to Address([192, 168, 0, 24]):4321...
connected, testing...
download: 943 kB/s
Testing upload...
connecting to Address([192, 168, 0, 24]):4322...
connected, testing...
upload: 771 kB/s
Testing upload+download...
connecting to Address([192, 168, 0, 24]):4323...
connected, testing...
upload+download: 616 kB/s

ESP32S3 (powered through uart, not USB serial jtag):

Testing download...
connecting to Address([192, 168, 0, 24]):4321...
connected, testing...
download: 562 kB/s
Testing upload...
connecting to Address([192, 168, 0, 24]):4322...
connected, testing...
upload: 387 kB/s
Testing upload+download...
connecting to Address([192, 168, 0, 24]):4323...
connected, testing...
upload+download: 311 kB/s

Both with the following configuration:

[esp-wifi]
rx_queue_size = 32
tx_queue_size = 6
static_rx_buf_num = 16
dynamic_rx_buf_num = 32
ampdu_rx_enable = 0
ampdu_tx_enable = 0
rx_ba_win = 32
max_burst_size = 6
heap_size = 100000

I feel like there must be some kind of CPU/Flash/Cache issue with the S3 to see this much of a difference, and maybe the more optimized interrupt you've been handling masks this issue a bit because it's less CPU bound 🤔.

@ProfFan
Copy link
Contributor

ProfFan commented Feb 16, 2024

Is is possible that one of the critical code path is not in IRAM for the ESP32-S3? I pretty much discovered the same thing, interrupt (and thus embassy) is very slow

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Feb 19, 2024

Is is possible that one of the critical code path is not in IRAM for the ESP32-S3? I pretty much discovered the same thing, interrupt (and thus embassy) is very slow

Cache misses would definitely increase latency a lot - #1169 should make a huge difference

@jessebraham jessebraham added the bug Something isn't working label Feb 29, 2024
@bjoernQ
Copy link
Contributor Author

bjoernQ commented Mar 7, 2024

I decided to look at this more in general and less in the direction of esp-wifi again.

My test setup is as this: have an GPIO interrupt trigger at falling edge of the boot button. The handler just toggles GPIO 2. Attached a logic analyzer and configured accordingly (400MHz sample rate, setting the right trigger) to measure the time from the falling edge of the boot button to the toggle of GPIO2. This measures a lot more than just the interrupt latency but should be good enough.

To get reasonable results things called by the user-handler is moved to RAM - for critical-section it was done by cloning the repo locally. (But could and should be done via linker-script like outlined at the end of this text)

I again checked how much the Xtensa register spilling changes things. In my experiments it's 0.5us. Sounds a lot but compared to the overall latency it's not really much.

First tested ESP32-S3

The first triggering gives 62.5us (!!!). From the second triggering on, it's down to 14.5us. Since besides the interrupts it's just running an empty loop it seems that the first triggering ran into cache misses. In a real application, things will be much worse (i.e. every interrupt invocation could see such latency)

Let's verify: set flash frequency to 20 MHz and .... the first triggering is 110us, after that it's 14.5 again. So, it is flash access, definitely.

I changed HAL code to just call the user-handler directly in __level_2_interrupt and latency went down to 5.1us also for the first invocation. So, the code dispatching the interrupt is accessing flash. (Changed it back for the next tests)

We assumed RISC-V is fine, let's check ESP32-C3

First invocation is 58.3us, then it stays at 18us. Seems like the same problem, then. Double checking with 20MHz flash frequency shows 98.5us for the first invocation, then 18us.

Looking at the dis-assembly of the vectoring code revealed it's also calling code in flash

  • core::panicking::panic
  • core::panicking::panic_bounds_check
  • memset
  • memcpy

We can get rid of the panicking stuff via panic-immediate-abort. In theory memset and memcpy should be weak symbols. I tried to provide these functions in ROM but it just didn't work. 🤷‍♂️

What worked for ESP32-C3 was omitting compiler-builtins from .text ( *(EXCLUDE_FILE (*compiler_builtins*.o) .literal .text .literal.* .text.*) ) and include it in .rwtext ( *compiler_builtins*.o(.literal .text .literal.* .text.*) )

That together with panic-immediate-abort gets the first triggering on C3 down to 22.9us. Still more than the next invocations but from the disassembly there shouldn't be calls to code in flash. 🤔

On ESP32-S3 gnu-ld ignored not placing compiler-builtins in .text - what seems to work there is

    EXCLUDE_FILE (*compiler_builtins*.o) 
    *(.literal .text .literal.* .text.*)

(but that in turn wasn't working for lld)

On S3 it gets down to 25.8us for the first invocation, then down to 14.5

So, there is still some flash access! Probably not code, but data. Looking at the assembly again showed some access to DROM addresses.

Probably having an option to place compiler-builtins in RAM would be a good thing no matter what. We should also have an option to place the code for critical-section in RAM.

We need to investigate what is actually accessed in DROM and what we can do about it.

We could also rewrite the vectored interrupt dispatching in assembly but that's quite some effort and harder to maintain

@bjoernQ bjoernQ added performance Performance seems to be not as good as it could be and removed bug Something isn't working labels Mar 7, 2024
@bjoernQ
Copy link
Contributor Author

bjoernQ commented Mar 8, 2024

So, I looked a bit into the DROM access.

One easy thing is CPU_INTERRUPT_LEVELS - the const array is placed in DROM and seems nothing can be done about it when const apparently. (link_section attribute is not possible for const)

What worked was making it static and specifying the `#[link_section = ".rwtext"]``

But that is not enough.

It seems that at least switch-tables are placed in .rodata always.

I don't see a way to completely avoid hitting flash currently. (Me not seeing it doesn't mean there isn't a way)

Falling back to (inline) assembly would give us full control and is the only option I see right now. Besides the downside of having less maintainable code it's quite probable that user code will hit the flash when we call it. But might still we worth it just for the async handlers we include in the HAL (if we can make them flash access free)

However, it's at least possible to cut down the worst-case latency with the (moderate) changes described in the comment above

UPDATE: I changed the order of text+rodata and rwtext+rwdata in esp32s3.x and let .data include *(.rodata..Lswitch.table.*) which gets me down to 18.55us (on ESP32-S3) when triggering the interrupt for the first time - there is still flash data access but it got slightly better (but sacrifice a chunk of RAM)

@MabezDev
Copy link
Member

MabezDev commented Mar 8, 2024

Thank you for looking into this @bjoernQ! Really helpful!

I think we want to avoid writing this all in assembly, I think later we should open up the option for "high" level interrupts written in asm, but in general, I think we want to keep our normal interrupt handling in Rust. I looked around the esp forums for interrupt latency topics, I saw values from 2us to 15us, at least we're heading this way now, and going from 62us to 15 is a big win!

I think some actionable items from your findings would be:

Moved the list to the top level issue.

@bjoernQ, if possible, would you be able to your benchmarking code, either in a gist or whatever is more suitable so we can have a reference when we begin to implement these things?

cc: @sethp, I know its been a while since you've poked your head into esp-land, but I noticed you made this post: https://discourse.llvm.org/t/moving-jump-tables-sections-on-embedded-systems/70226 and I was wondering if you made any headway in your project.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Mar 8, 2024

Here is the code for what I used to benchmark: https://gist.github.com/bjoernQ/4bc8236b926803e6aa22960880af70cd

it's good to use a logic-analyzer which supports a reasonable high sample rate (I used DSLogic Pro at 400MHz)

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Mar 11, 2024

For not emitting jump-tables there is

"-Z", "no-jump-tables", - RFC: https://doc.rust-lang.org/beta/unstable-book/compiler-flags/no-jump-tables.html

Currently it seems there is no support for doing that on a function level: https://github.com/rust-lang/rust/blob/master/compiler/rustc_middle/src/middle/codegen_fn_attrs.rs

@sethp
Copy link
Contributor

sethp commented Mar 11, 2024

Wow, awesome work @bjoernQ ! We never got as precise as measuring latency in seconds, just by proxy (i.e. whether or not our output was "flickering" when we were blowing our real-time budget on cache misses). It's very cool to see these numbers, and thank you for sharing the benchmark! Your analysis seems like it matches more or less what I found working on #534 : flash is cache thrash, and rustc loves to sneak in various accesses.

Currently it seems there is no support for doing that on a function level: https://github.com/rust-lang/rust/blob/master/compiler/rustc_middle/src/middle/codegen_fn_attrs.rs

Yeah, that's a LLVM limitation, I believe: I'm not quite sure what optimization pass is producing those jump tables, but I couldn't find a way to configure it at a more granular level than global. The unfortunate part is that setting that in the build flags for esp-hal does nothing for any consumers of the crate, IIRC: the build flags aren't carried over.

There is -C llvm-args=--max-jump-table-size=0 you could play with (or, possibly, --min-jump-table-entries= but that was ignored when I last tried it). A half-baked idea I had was to set that threshold just below whatever size start_trap_rust_hal would emit, since it's a fairly large number, but that just would've moved the problem for us to our handler.

There's also a very suspiciously named "jump tables in function section".... thing, down at the LLVM level; I didn't fully trace that through to identify where/how to turn it on, but this signature suggests it's also configurable per-function: https://github.com/llvm/llvm-project/blob/9a9aa41dea83039154601082b1aa2c56e35a5a17/llvm/include/llvm/CodeGen/TargetLoweringObjectFileImpl.h#L83-L84 (though maybe not at the rust source level).

I don't see a way to completely avoid hitting flash currently. (Me not seeing it doesn't mean there isn't a way)

The results I found were similar to yours; I thought there might be a path through LLVM, which leads me to:

cc: @sethp, I know its been a while since you've poked your head into esp-land, but I noticed you made this post: https://discourse.llvm.org/t/moving-jump-tables-sections-on-embedded-systems/70226 and I was wondering if you made any headway in your project.

Not the ELF post-processor or LLVM change1, but I did find a way to route around it. I tried a few different rust source-level constructs to avoid the jump table, but the optimizer was too clever. That did lead me to the realization that we already have implemented a jump table at the rust source level, though: peripherals::__EXTERNAL_INTERRUPTS! And, since that's a static, #[link_section = "..."] works just fine for section placement.

So changing the troublesome match from:

match code {
1 => interrupt1(...),
...
}

to something like:

HANDLER_VECTOR[code](...) // given HANDLER_VECTOR[1] == interrupt1 , etc.

ought to avoid the whole jump table placement question entirely.2 Populating HANDLER_VECTOR is another story, as is bounds-checking, but that seemed the most viable path forward to me at the time.

Footnotes

  1. I didn't pursue it because realizing that change in our downstream project would have been a lot of work, but perhaps the situation is different for Xtensa? Teaching the TargetLoweringObjectFileELF to emit its jump tables into .rodata.jtable or something that can be targeted with a linker script should be straightforward, it's this one line that needs to change.

  2. Keen eyed readers may note that looks an awful lot like the hardware's interrupt vector, modulo some trampolining. That's what feat: riscv-interrupt-{m,s} calling conventions rust-lang/rust#111891 was all about: avoiding start_trap_rust_hal entirely by slotting extern "riscv-interrupt-m" fn interruptN into the hardware vector directly. I have a prototype laying around on a disk somewhere that shows with some weak symbols and a little more finessing with #[inline] it's possible to avoid saving more registers than the entire handling path uses, which can dramatically lowers not just interrupt-to-entry latency but end-to-end interrupt handling latency overall for simple handlers (like something that queues async work for later?). If you'd like, I can dig up the details: unfortunately, I don't think I wrote down the path from #111891 through to the HAL anywhere public, but I should have some notes on it lying around.

@MabezDev MabezDev changed the title Interrupt latency using the vectored feature (especially Xtensa) Improve interrupt latency (especially on Xtensa) Jun 5, 2024
@MabezDev MabezDev self-assigned this Jun 12, 2024
@MabezDev MabezDev added this to the 0.19.0 milestone Jun 12, 2024
@MabezDev MabezDev assigned bjoernQ and unassigned MabezDev Jun 12, 2024
@bjoernQ
Copy link
Contributor Author

bjoernQ commented Jun 14, 2024

Had another look at this.

"Replace consts with static's if they're referenced in interrupt handling code" doesn't have a huge impact but increases the RAM usage - we probably don't want to do this by default (would mean a new feature or we need the build-time configuration)

Not emitting jump tables has some impact on the initial invocation but impacts performance in a negative way overall. Also, it's something only the binary crate can configure.

Moving jump tables etc. to RAM might be a good option but again, increases RAM usage which should be an opt-in / opt-out but we shouldn't just do it unconditionally.

@sethp
Copy link
Contributor

sethp commented Jun 14, 2024

It's a maze of tradeoffs, to be sure; for us the question was "how much RAM can we trade for latency & reduced variability," which made the approach feel a lot more crisp. A small note, though—the micro-benchmark you're using doesn't look like it'll exert any cache pressure, which is probably why your results differed from ours on the jump tables:

Not emitting jump tables has some impact on the initial invocation but impacts performance in a negative way overall.

Binary searching through 32 possibilities will take 5 branches, whereas the table approach costs only a single load-dependent jump, which will be cheaper if (& only if) the load is cached in SRAM. The trick is that a more involved program is likely to clobber the cache in between interrupt entry, so every time looks a lot more like the initial invocation than the subsequent ones in the benchmark.

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Jun 14, 2024

I will definitely get back to this once we have a sane way to do fine grained configuration by the user (i.e. not hundreds of Cargo features 😄)

@ProfFan
Copy link
Contributor

ProfFan commented Jun 28, 2024

I wonder how hard it is to make a feature that puts the jump table in RAM? I am trying to build a Rust-based flight controller and kept bumping into unexpected latencies because I am running the two cores on the ESP32-S3 almost in full, so as @sethp said I have huge cache pressure. Since we have 4bytes for each interrupt and total 128 interrupts max, a 512 byte RAM price is a great deal for deterministic latency.

image

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Jun 29, 2024

Such functionality wouldn't be too hard to do via the linker script. It's just that we currently try to avoid introducing new features but want that build-time config system

@bjoernQ
Copy link
Contributor Author

bjoernQ commented Jul 4, 2024

Going to close this in favor of #1747

@bjoernQ bjoernQ closed this as completed Jul 4, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in esp-rs Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance seems to be not as good as it could be
Projects
Archived in project
Development

No branches or pull requests

5 participants