feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

sethp · 2023-05-14T17:59:48Z

As mentioned over on #448 the current RISC-V HAL interrupt flow isn't memory-resident, with large portions of it living in Flash.

This change set is not intended to be merged as-is, or probably at all: it's more of a "spike" or "steel thread" demonstrating the areas of work I've been able to identify in order to meet the goal of "[an] interrupt request is serviced without touching flash memory." My friend & I needed to do this because we have a hard-real-time interrupt that fires at ~480Hz (60Hz * 8), which made for some 🌶️ contention at the cache layer.

As usual with these things, a lot of the wins can be had for relatively little effort. I've tried to structure these changes in increasing order from least implementation/maintenance burden to most, with each commit message giving a detailed explanation of what problem I believed I was solving at each step. Happily, they're also largely in sync with the size of the wins (with the exception of the mem* friends, discussed later).

What I'm hoping for from this discussion is mostly just to share my findings, and to identify where we might already have consensus on what the next steps ought to be (where any). In an effort to apply Cunningham's Law, my current beliefs about the disposition and direction of these changes are below.

ready for (early) review

~~fa671bb is something I intend to replicate in all the other RISC-V linker scripts, at which point that change will be ready to be merged into the mainline HAL.~~ feat(riscv): relocate .trap machinery to RAM #541
cda36dd ought to be carried through to complete the conversion from const slices to inline-able functions. I don't yet have any 'c6s to try out the PLIC side of things, though I did order some for that purpose.
e29f3d5 (which is really rustbox/esp-pacs@c8e7cbb) is worth pursuing, which would involve:
- making a change to the esp-pacs generator relocating that object to .trap.rodata (my preference), or .data
- (optionally) if using .trap.rodata, making a change at this point to the linker scripts that live here in esp-hal to emit objects from that section
- the release train leaves the station, I guess? it's not a breaking change if we can ensure the newer pacs are only ever used with the newer linker scripts, so there's maybe a phrasing of cargo constraints here that allows this to be a patch release. Or, we use .data and it's definitely not breaking.
115a334 is a toss-up for me; it's written, I got to feel like a C-x M-c M-butterfly programmer, and it works as far as I've tested it. But, the LLVM lookup table that it replaces is very small, about 0x40 bytes.¹ It meets a very strict "no flash, ever" target, which I think applies more to the RTC memory stuff (that I have only surface-level knowledge of) than to interrupt handling.

here be dragons

34fc2fb, 54fc729, and d43df90 are not worth merging directly, not least because I don't think they actually have an effect on any consumers of the HAL outside the esp32c3/src/examples directory. They do demonstrate just how tricky it is to 1) predict and 2) validate whether a particular call graph is memory resident or not. Initially, I believed inspecting the rust functions and throwing a few #[inline(always)]s around would be enough to ensure that property under a wide range of conditions, but what I found was... not that.
- If there is an action here, I think it would be to consider updating the template project with some of the settings.
b511568 is a tough one: as you might expect, the care and feeding of a memcpy implementation is significantly more difficult than it first appears. Regrettably, it's just so frickin' good: memcpy (and to a lesser degree, memset, memmove, and memcmp) show up a lot: even at the highest optimization levels, there are 9 memcpy call sites in the gpio_interrupt example alone. In my downstream project, we've got ~50 memcpy calls that span the gamut from "one time, at startup" to "hottest of the hot path." Implementing that family of functions ourselves was a tremendous performance win across the board, and is well worth the high price of maintenance, at least for us.
- Probably the thing to do is defer, at least for now: the esp-hal side of the IRQ flow only uses memset/memcpy when dealing with the [u128; 16] array that represents all configured interrupts, so it seems plausible that finding a different way to represent that data structure or fetching the priorities a few at a time instead of all at once might dodge the use of those functions, which here would allow us to meet the "nothing in flash" requirement without taking ownership of them.
- I'm a little bit at a loss as to where or whom we should defer to, though: is this something that the rust-embedded folks would be interested in? It seems like a lot of microcontrollers are going to be in the same boat as the esp32c3 here, no?
after all that, 8efc0b9 is both anti-climactic and a little bit distressing: on the plus side, it's short! On the minus side, that's because the example is trivial, and what little of it there is points towards something I've been avoiding mentioning thus far: I have no idea what to do about core. The first and most immediate place that arises is, as here, in string formatting. However, it's not actually possible to work around in that way, because formatting an error message is a huge part of calls to panic!, most of which get emitted automatically for e.g. bounds checks on arrays, unwap calls, arithmetic overflow (when enabled), etc. Mostly this is 🔥 fine 🔥 because panic!s are generally non-recoverable, so taking extra time to load a string from flash right before the CPU goes into a sleep-until-reset loop isn't really a big deal. That said, it feels to me like a potentially large area of future work, with an uncertain scope and level of risk. I'm very curious what y'all's thoughts are on both how important it is to get the panic handling bits fully in RAM and whether there's any strategies that come to mind about how to do it.

caveats

the above applies only to the interrupt trap flow, I haven't yet examined the exception handling flow, nor the atomics emulation that's part of that. I haven't measured, yet, but that seems like a worthwhile unit of future work too.
and, finally, I don't want to forget about moving EspDefaultHandler, ExceptionHandler, and DefaultInterruptHandler; those'd be separate PRs against esp-riscv-rt and esp-backtrace (probably), but I haven't put much thought yet into helping out downstream crates that might provide their own implementations.

Phew, that was a whole lot! Let me know if there's a time y'all want to hop on a video call or if there's a semi-synchronous chat I should join to go through this together! I don't expect this to be our last opportunity to connect, though, so no pressure from my end either way.

cc @MabezDev

I can't for the life of me identify the object in the ELF, but from looking at the hex dump of .rodata it's tucked in between two fairly obvious strings that at least provide an upper bound on the size. ↩

Previously, the trap vector itself and its immediate callees (`_start_trap` and `_start_trap_rust_hal`) were located in the mapped instruction flash range `0x420..`, increasing cache pressure and adding variable latency to the very beginning of the interrupt/exception service flow. This change places those routines in iram directly: ``` Num: Value Size Type Bind Vis Ndx Name 48177: 40380280 2428 FUNC GLOBAL DEFAULT 6 _start_trap_rust_hal 48197: 40380bfc 54 FUNC GLOBAL DEFAULT 6 _start_trap_rust 48265: 40380200 0 FUNC GLOBAL DEFAULT 6 _vector_table 48349: 40380100 0 NOTYPE GLOBAL DEFAULT 6 default_start_trap 48350: 40380100 0 NOTYPE GLOBAL DEFAULT 6 _start_trap ``` As seen via `readelf -W -s -C ./target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E _start_trap\|_vector\|Ndx`

Previously, the `INTERRUPT_TO_PRIORITY` constant would compile down to a runtime lookup table called `anon.HASH.2.llvm.NUMBERS` like so: ```asm status & configured_interrupts[INTERRUPT_TO_PRIORITY[cpu_intr as usize - 1]]; ... 40380fbc: 3c0095b7 lui a1,0x3c009 40380fc0: 2d858593 addi a1,a1,728 # 3c0092d8 <anon.73967c570b981c3046610ed0622f71c3.2.llvm.13737670072012269642> 40380fc4: 952e add a0,a0,a1 40380fc6: 4108 lw a0,0(a0) ``` Which means that in order to load the `a1`th slot from the LUT, we'll have to go out to flash to do it. As an alternative to an `#[inline]`'d function, which is nice in the 'c3's case because it's the identity function that disappears at runtime, instead of using a `const` INTERRUPT_TO_PRIORITY could be made a `static`, which would allow us to place it into the `.data` section via a linker directive.

Prior, the code for `intrinsics::cttz` (e.g. `u128.trailing_zeros()`) was being inlined into the call site, but that code was referencing a lookup table from the `.rodata` section (placed in flash): ``` 40380xxx: 3c0066b7 lui a3,0x3c006 40380xxx: ed168293 addi t0,a3,-303 # 3c005ed1 <str.1+0x81> 40380xxx: 03810813 addi a6,sp,56 ... 40380xxx: 41e00633 neg a2,t5 40380xxx: 00cf7633 and a2,t5,a2 40380xxx: 03160633 mul a2,a2,a7 40380xxx: 826d srli a2,a2,0x1b 40380xxx: 9616 add a2,a2,t0 40380xxx: 00064603 lbu a2,0(a2) ``` Caution: The `str.1` symbol name there is a red herring: `str.1` has a length of just 0x21 bytes, there's just seemingly no symbolic name associated with the jump table and `str.1` happens to be the closest preceding label. I can't claim credit for the `ctz32` algorithm here, all I've done is copy it from Wikipedia and modify the operations to the Rust equivalents, and then applied it four times to extend it to a `u128`. I have tested it both for `0b0..010..0` (single bit set) and `0xff...100` (all non-trailing bits set) for every bit position, which did catch at least two mistakes in the translation.

Previously, as RODATA this was getting placed in flash, so this lookup via `interrupt::riscv::vectored::handle_interrupt`: ```rust let handler = peripherals::__EXTERNAL_INTERRUPTS[interrupt as usize]._handler; ``` was compiling to: ``` 40380f1e: 3c006537 lui a0,0x3c006 40380f22: 63c50c93 addi s9,a0,1596 # 3c00663c <__EXTERNAL_INTERRUPTS> ... 40381112: 9566 add a0,a0,s9 40381114: 410c lw a1,0(a0) ``` which dereferences a pointer into the flash segment `0x3c0...`. After this change to apply the linker directive from the corresponding [commit to the pac crate][pac-commit] o Num: Value Size Type Bind Vis Ndx Name 48188: 3fc80608 248 OBJECT GLOBAL DEFAULT 9 __EXTERNAL_INTERRUPTS ``` Num: Value Size Type Bind Vis Ndx Name 48188: 3fc80608 248 OBJECT GLOBAL DEFAULT 9 __EXTERNAL_INTERRUPTS ``` As seen by `readelf -W -s -C ./esp32c3-hal/target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt | grep -E __EXT\|Ndx`. [pac-commit]: rustbox/esp-pacs@c8e7cbb

This change gets a fair bit less clear; removing `#[ram]` from `handle_interrupt` does indeed change the resulting binary (because it's implicitly removing `#[inline(never)]`, but `get_core` is so small a function LLVM will always inline it at all the optimization levels I've tried, whether it's marked `#[inline]` or not. So rather than a specific solution to an actual problem, this patch is more of an example and a warning for what's to come: lacking a precise mechanism to trace and identify properties of all callees along a particular path, everything from here on out is highly dependent on relatively distant effects: in attempting to assert whether "the trap handling flow is entirely located in SRAM" we necessarily have to qualify the answer with "given these optimization settings" (and, soon, much worse). If correcteness here is "not touching flash," then, in the usual language of Rust, we're about to leave attempts at demonstrating soundness behind. Instead, we now shift our focus to showing safety under a particular set of (somewhat fragile) circumstances.

As with the prior change, the problem we're solving here is shrinking the state space: there are many branch points within the optmization machinery, and these settings align with what I've been looking at most thoroughly. NB: I have no idea what effect, if any, the Cargo.toml settings have on downstream crates. Forcing frame pointers in `.cargo/config.toml` definitely has no effect.

Removes some flash accesses that occur as the result of an optimization when mapping a sequence of integers to an action, such as calling handler `interrupt7` for interrupt code 7. Previously, this would compile to something similar to: ```asm 40380538: 34202573 csrr a0,mcause 4038053c: 0506 slli a0,a0,0x1 4038053e: 8105 srli a0,a0,0x1 match code { 40380540: 157d addi a0,a0,-1 40380542: 45f9 li a1,30 40380544: 2ca5e163 bltu a1,a0,40380806 <.LBB2_7+0x2ae> 40380548: 050a slli a0,a0,0x2 4038054a: 3c0075b7 lui a1,0x3c007 4038054e: 1dc58593 addi a1,a1,476 # 3c0071dc <.LJTI2_0> 40380552: 952e add a0,a0,a1 40380554: 4108 lw a0,0(a0) 40380556: 8502 jr a0 ``` followed by a bunch of entries corresponding to various `interruptN` calls. `.LJTI2_0` there is stored in flash, and so this requires some amount of Flash access and/or induces cache pressure. Unfortunately, LLVM's ELF lowering is fairly determined to emit these jump tables as part the [`.rodata` section][elf-lowering-jt-sec], which makes them difficult to specifically relocate to SRAM on their own to preserve the optimization. So, this instead disables the otpimzation globally (for all crates and all paths), but only for examples built directly from the HAL (downstream crates would need to set this setting themselves, as with force-frame-pointers). [elf-lowering-jt-sec]: https://llvm.org/doxygen/TargetLoweringObjectFileImpl_8cpp_source.html#l00942

Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)

The culmination of this entire patch series, we now place the interrupt handling code itself into SRAM. The `#[ram]` attribute is for the `GPIO` function itself (and `#[inline(never)]` works fine here because LLVM can't "see" through the ~indirect reference to this function), but then that leaves the matter of the function body itself. Unfortunately, this is necessarily work specific to the details of each handler's implementation. Elsewhere, I got started down this road because we needed SPI to start up within a certain time bound, and so we needed to add attributes to much of the HAL's SPI implementation. And, in addition to `memset` and `memcpy` we had to override `memmove` as well, because something about the datastructures used for DMA made `rustc` use that in a few places. As far as this particular example goes, touching `core::fmt` pulls in a lot of code from fairly far outside our known methods of control: ``` write(&mut self, args) 40380f00: 3c006537 lui a0,0x3c006 40380f04: 7a850513 addi a0,a0,1960 # 3c0067a8 <.Lanon.6ccc2de621f5491d2d69154e15b03ba0.16> 40380f08: fca42c23 sw a0,-40(s0) 40380f0c: 4505 li a0,1 40380f0e: fca42e23 sw a0,-36(s0) 40380f12: 3c006537 lui a0,0x3c006 40380f16: 6a450513 addi a0,a0,1700 # 3c0066a4 <anon.f0eba740dd2cb5e3ab050cbaac366ec5.84.llvm .5894916478550538780> 40380f1a: fea42023 sw a0,-32(s0) 40380f1e: fe042223 sw zero,-28(s0) 40380f22: fe042423 sw zero,-24(s0) 40380f26: 3c006537 lui a0,0x3c006 40380f2a: 5fc50593 addi a1,a0,1532 # 3c0065fc <anon.7f39784d2cd27331f7ef17a88d595274.0.llvm. 10122694886931673864> 40380f2e: fd440513 addi a0,s0,-44 40380f32: fd840613 addi a2,s0,-40 40380f36: 01c82097 auipc ra,0x1c82 40380f3a: 92a080e7 jalr -1750(ra) # 42002860 <core::fmt::write> ``` So, this change circumvents the convenience machinery to call `write_str` directly, thereby avoiding touching `core::fmt::write` and its anonymous helper data. Curiously, with the combination of optimization settings in use, doing so actually erases the `"GPIO Interrupt\n"` string entirely in favor of an inlined, unrolled loop emitting each byte in succession: ```asm 40380efe: 40000537 lui a0,0x40000 40380f02: 06850493 addi s1,a0,104 # 40000068 <ets_delay_us+0x18> 40380f06: 04700513 li a0,71 40380f0a: 9482 jalr s1 40380f0c: 05000513 li a0,80 40380f10: 9482 jalr s1 ... ``` (for those, like me, who have not memorized the decimal->ACII lookup table, 71 is 'G' and 80 is 'P') NB: LTO is _required_ at this stage: turning it off produces calls to things like `_critical_section_1_0_{acquire,release}` and some `Mutex` interactions. The latter might be fixable, but I didn't look into it because the former is seemingly not: the `extern "Rust"` seems to form a boundary that currently can only be pierced by applying LTO.

bjoernQ · 2023-05-15T12:14:42Z

relates to #43

MabezDev · 2023-05-15T16:42:34Z

@sethp thanks for looking into this! This is really useful experimentation. I won't address the whole discussion at once as there is a lot to digest.

Commits 1 & 3 seem like no-brainers, and I don't think will have any issues being merged. Commit 2 looks okay, but the commit seems to have an undocumented change in that it is no longer subtracting one from the cpu_intr variable. I'm less keen on commit 4, perhaps it's possible to override some intrinsics (maybe they have weak linkage? I added something like this for the mem instrinsics: rust-lang/compiler-builtins#411) instead of having custom implementations inside esp-hal.

@onsdagens has also been looking into improving interrupt performance for the C3 for the RTIC port. We found that we're not actually using CPU vectoring which could eliminate this look up & function call completely, have you looked into that at all?

Phew, that was a whole lot! Let me know if there's a time y'all want to hop on a video call or if there's a semi-synchronous chat I should join to go through this together! I don't expect this to be our last opportunity to connect, though, so no pressure from my end either way.

Sure! We can start via matrix if that works for you; feel free to DM me or @bjoernQ . If we feel the need to jump on a call later, we certainly can.

sethp · 2023-05-16T00:44:43Z

@onsdagens has also been looking into improving interrupt performance for the C3 for the RTIC port. We found that we're not actually using CPU vectoring which could eliminate this look up & function call completely, have you looked into that at all?

Ohhh yes, this is a very excellent point: a much better strategy than turning off jump tables is to use the existing hardware jump table (at least for the HAL entry bits; IIRC we still had more jump tables to deal with inside the ISR's callees). I have a prototype of this somewhere that I'll try to dust off this week.

I'm less keen on commit 4, perhaps it's possible to override some intrinsics (maybe they have weak linkage? I added something like this for the mem instrinsics: rust-lang/compiler-builtins#411) instead of having custom implementations inside esp-hal.

Oh, good thought: I'll look into this too. It's a little trickier to override since it's not in compiler-builtins but deep in LLVM.

bjoernQ · 2023-05-16T08:11:41Z

I wonder if it would be possible to place intrinsics code (if not inlined) and data into SRAM via some black-magic in linker-scripts?

Also as a side-note: Probably all this should be behind a feature flag (maybe a default feature). The thing is that e.g. in the matter-rs experiment we are already hitting RAM limits (e.g. increasing some network buffers by just a few bytes makes it stop working currently) - for sure there is a lot potential for optimization there but for things like that interrupt latency isn't much of a concern while RAM usage is.

sethp · 2023-05-17T18:13:20Z

I wonder if it would be possible to place intrinsics code (if not inlined) and data into SRAM via some black-magic in linker-scripts?

I'm not sure about the code, but it's inlined here anyway: the trouble is the lookup table that code uses. That table is seemingly a totally unnamed symbol, and there doesn't appear to be any linker invocation that's of the shape "referents from this named root" (which would be handy for making basically everything in this PR unnecessary!). The closest I could see from linker-land was "take everything in the .rodata section of the libcore object and put it in this output section", which I didn't really consider because I just assumed that would use too much RAM to be useful.

Speaking of, your addenda is well taken: I agree that putting these allocations behind some kind of conditional would make sense (especially as these chips seem quite capable of executing interrupt handlers entirely from flash). I'm a little hesitant to use a cargo feature for it since the granularity there is fairly coarse.

I wonder if we could defer the question? It looks like the amount allocated here isn't very much in a relative sense (about 1% each of DRAM and IRAM):

     Running `target/debug/mem ../esp-hal/esp32c3-hal/target/riscv32imc-unknown-none-elf/debug/examples/gpio_interrupt`
DRAM allocation:
addr                    size    (    bytes    )        %        name
0x3fc80000-0x3fc80118   0x00118 (   280.000B  )   0.085%        .data
0x3fc80118-0x3fc81728   0x01610 (     5.516KiB)   1.716%        .rwtext.dummy
0x3fc81728-0x3fc81738   0x00010 (    16.000B  )   0.005%        .bss
0x3fc81738-0x3fc81738   0x00000 (     0.000B  )   0.000%        .uninit
0x3fc81738-0x3fc81738   0x00000 (     0.000B  )   0.000%        .heap
0x3fc81738-0x3fcd0000   0x4e8c8 (   314.195KiB)  97.728%        .stack
                total   0x50000 (   320.000KiB)  99.533%        —
IRAM allocation:
addr                    size    (    bytes    )        %        name
0x40380120-0x40381720   0x01600 (     5.500KiB)   1.378%        .rwtext
                total   0x01600 (     5.500KiB)   1.378%        —

vs the baseline from the hal's current main:

DRAM allocation:
addr                    size    (    bytes    )        %        name
0x3fc80000-0x3fc80000   0x00000 (     0.000B  )   0.000%        .data
0x3fc80000-0x3fc808fc   0x008fc (     2.246KiB)   0.699%        .rwtext.dummy
0x3fc808fc-0x3fc8090c   0x00010 (    16.000B  )   0.005%        .bss
0x3fc8090c-0x3fc8090c   0x00000 (     0.000B  )   0.000%        .uninit
0x3fc8090c-0x3fc8090c   0x00000 (     0.000B  )   0.000%        .heap
0x3fc8090c-0x3fcd0000   0x4f6f4 (   317.738KiB)  98.830%        .stack
                total   0x50000 (   320.000KiB)  99.533%        —
IRAM allocation:
addr                    size    (    bytes    )        %        name
0x40380008-0x403808f4   0x008ec (     2.230KiB)   0.559%        .rwtext
                total   0x008ec (     2.230KiB)   0.559%        —

(these generated with the very rough, very 'c3 specific tool I've been using for analyzing allocations; I threw it up here if you want to check it out: https://github.com/rustbox/supreme-train/blob/main/src/bin/mem.rs ).

sethp · 2023-05-19T05:34:22Z

Happy Friday! I've got a brief update for you on two items here:

Hardware Vectoring

A draft of hardware vectoring PoC is up, which at this point is almost entirely in esp-riscv-rt: esp-rs/esp-riscv-rt#11 . It works, but I was hoping to spend a little more time carrying the thought through to the esp-hal crate before merging, since there's a whole release thing involved.

I'm fairly excited about how that's working out, though: I think it'll offer a nice way to sidestep some of the thornier issues in this PR, at least for the moment. If we do carry the vectoring all the way "through" instead of fanning back in at handle_interrupts, as that PR still does, I suspect we'll be able to defer the memcpy/memset and maybe even cttz questions for the purposes of the HAL machinery. Downstream code will still have the same shape of problem, but at least has one or two more tools to solve them.

Also, I'm feeling particularly clever about moving a little away from KEEP in #541, since that'll hopefully simplify providing but not mandating the use of the 31 snazzy new entrypoint routines to thread the needle of the space/performance tradeoff that @bjoernQ raised.

svd2rust

I got the ball rolling on integrating the linker directive for __INTERRUPTS all the way uphill. There's also a corresponding change to the esp-pacs repo that I don't expect y'all to merge (crates.io forbids [patch] sections in released crates, right?) but I thought you might like the opportunity to see how it's coming along anyway.

I'm also on matrix now as @sethp:matrix.org, so feel free to hit me up with any questions over there too. Thanks, and have a good weekend!

MabezDev · 2023-06-27T17:43:26Z

Jumping back into this thread after the initial merges + some travelling on my end to address the other parts you brought up :).

I am on board with adding optimized memcpy routines (b511568) I think we'd see a net benefit in performance across all applications for little cost. Perhaps one experiment you may wish to try would be to use the built-in ROM memcpy routines, the CPU should have one cycle access (though I've asked internally for some clarification on this) to ROM code; I'm also not sure if ROM code is executed in place or is pulled into the cache. If it avoids hitting the cache whilst having the same execution performance as SRAM that could be a win.

However, it's not actually possible to work around in that way, because formatting an error message is a huge part of calls to panic!, most of which get emitted automatically for e.g. bounds checks on arrays, unwap calls, arithmetic overflow (when enabled), etc. Mostly this is 🔥 fine 🔥 because panic!s are generally non-recoverable, so taking extra time to load a string from flash right before the CPU goes into a sleep-until-reset loop isn't really a big deal. That said, it feels to me like a potentially large area of future work, with an uncertain scope and level of risk. I'm very curious what y'all's thoughts are on both how important it is to get the panic handling bits fully in RAM and whether there's any strategies that come to mind about how to do it.

This pains me too, but sadly it seems we need language-level help to overcome this. I've seen a couple of folks poke around the rustc formatting machinery, and subsequently burn out pretty fast. I think the only person with recent success with that was @m-ou-se, though I forget the exact PR they submitted.

You are very well versed in Rust and all its intricacies so I presume you have tried this, but how does panic_immediate_abort affect panic's within these RAM only routines? In theory on panic it should just call core::intrinsic::abort(), so the compiler should optimize out any formatting machinery.

MabezDev · 2023-06-28T09:18:33Z

Perhaps one experiment you may wish to try would be to use the built-in ROM memcpy routines, the CPU should have one cycle access (though I've asked internally for some clarification on this) to ROM code; I'm also not sure if ROM code is executed in place or is pulled into the cache. If it avoids hitting the cache whilst having the same execution performance as SRAM that could be a win

I've confirmed that ROM code is executed in place and won't touch the cache.

MabezDev · 2023-07-27T22:08:55Z

@sethp any thoughts on my comments above?

jessebraham · 2023-08-22T14:39:33Z

Closing this due to inactivity. Please feel free to re-open this PR, or to open a new one, if you wish to return to this work in the future.

sethp added 9 commits May 10, 2023 17:24

sethp mentioned this pull request May 19, 2023

feat: use vectoring to jump straight to interruptN esp-rs/esp-riscv-rt#11

Merged

jessebraham closed this Aug 22, 2023

sethp mentioned this pull request Mar 11, 2024

Improve interrupt latency (especially on Xtensa) #1162

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

sethp commented May 14, 2023 •

edited

Loading

bjoernQ commented May 15, 2023

MabezDev commented May 15, 2023

sethp commented May 16, 2023 •

edited

Loading

bjoernQ commented May 16, 2023

sethp commented May 17, 2023 •

edited

Loading

sethp commented May 19, 2023 •

edited

Loading

MabezDev commented Jun 27, 2023

MabezDev commented Jun 28, 2023

MabezDev commented Jul 27, 2023

jessebraham commented Aug 22, 2023

feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

feat(riscv+esp32c3): demonstrate gpio_interrupt example entirely in SRAM #534

Conversation

sethp commented May 14, 2023 • edited Loading

ready for (early) review

here be dragons

caveats

Footnotes

bjoernQ commented May 15, 2023

MabezDev commented May 15, 2023

sethp commented May 16, 2023 • edited Loading

bjoernQ commented May 16, 2023

sethp commented May 17, 2023 • edited Loading

sethp commented May 19, 2023 • edited Loading

Hardware Vectoring

svd2rust

MabezDev commented Jun 27, 2023

MabezDev commented Jun 28, 2023

MabezDev commented Jul 27, 2023

jessebraham commented Aug 22, 2023

sethp commented May 14, 2023 •

edited

Loading

sethp commented May 16, 2023 •

edited

Loading

sethp commented May 17, 2023 •

edited

Loading

sethp commented May 19, 2023 •

edited

Loading