-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bad double for loop optimization #8261
Comments
Slightly modified test code to digest / try to understand all of functions mentioned above: Is the issue with |
@mcspr This is interesting, but I'm not sure the repo is the right place to pose the question. Given that it seems you can isolate the issue to a single test routine, wouldn't the GCC bugs mailing list be more appropriate? https://gcc.gnu.org/bugs/ While we do have simple machine definition patches for the xtensa, I'm really not sure how they would even be tangentially related to the code you're showing. https://github.com/earlephilhower/esp-quick-toolchain/tree/master/patches/gcc10.3 Maybe a clean gcc10.3 build from scratch with the ESP8266 xtensa architecture headers, then a |
@mcspr Thanks for reviewing. The problem I had with creating a simple example for the original problem I observed, was a certain amount of complexity appeared to be needed for a failure to occur. From empirically observations *p16 and the 2 bytes offset instead of 4 bytes result in too simple an example. Your example is showing the correct answer for these and a failing result from pgm_read_word. If we increase the array size to 4 elements and align to 4 to correct the out-of-bounds technicality and keep the C function with 32-bit word access, this change can still demonstrate the failure without introducing Extended ASM. Extended ASM is also used in PROGMEM, I wanted to show/prove that the issue was not with Extended ASM.
|
Oops, my bad. Sorry, my previous comment #8261 (comment) should have been directed to @mhightower83 . Rough Monday morning here! |
@mhightower83 true, first two cases just try to 'fix' the access through the pointer and are not really supposed to fail though (plus "memory" to circumvent the gcc caching / whatever it does with memory accesses there, I don't really have a great explanation for asm as I mostly try to avoid it instead of actually use it :) I am leaning to the idea that optimization is correct in it's assumptions, and to actually read the values the code should change it's approach? As for failing examples, consider the order of operations here (which is sort-of similar to what happens in the pgm & plain-c variants) uint16_t a[2] {1,2};
Serial.printf("unaligned 0x%08X\n", reinterpret_cast<uintptr_t>(&a[1]));
uintptr_t aligned = reinterpret_cast<uintptr_t>(&a[1]) & static_cast<uintptr_t>(~3);
Serial.printf("aligned 0x%08X\n", aligned);
a[0] = 3;
a[1] = 4;
uint32_t b = *reinterpret_cast<const uint32_t*>(aligned);
Serial.printf("%08X\n", b);
Serial.printf("a[0](%p)=%hu\n", &a[0], a[0]);
Serial.printf("a[1](%p)=%hu\n", &a[1], a[1]); Placed into the setup:
And I would also agree with gcc mail list advice *https://gcc.gnu.org/lists.html |
Working through the "Before reporting that GCC compiles your code incorrectly, compile it with ..." at https://gcc.gnu.org/bugs/, I found that the As for the new issue with Extended ASM and the 10.3 compilers read caching optimization. The Extended ASM references memory through a pointer the compiler doesn't know about it and the data read may be stale because the compiler is still holding the current value in a register. My solution, don't use memory pointers in Extended ASM for reading or writing to memory. For me, this answer on StackOverflow was useful "What is the strict aliasing rule?" https://stackoverflow.com/a/99010
Unless I misunderstood something, it appears I have gone from bug to feature. |
Huh. From the same page, there is suggestion about unions, which are a special case for GCC: uint16_t my_pgm_read_word(const void* p16) {
union AsU16 {
uint16_t value;
uint32_t aligned;
};
const auto* ptr = reinterpret_cast<const AsU16*>(p16);
return (*ptr).value;
} Numbers are the same for the test array, also reads PROGMEM pointer as the name suggests. I still don't get how PROGMEM is involved though, if the issue is with changing values with different ptr types and unless it's done like the test above, how is it changing the data on the flash / at that memory location? |
@mcspr Right, it occurred to me, while clearing my head, that PROGMEM functions, when used against flash, would not be an issue. It is when they are used against IRAM that trouble could follow, which is what my test simulated (however, it was really DRAM). So with that thought, I think the Also, I think it is clear that in the past many programs would have worked fine w/o following the My mind is numb and wants to stop, I'll close with this quick search result: |
Another very good read (so far... still reading it) What is the Strict Aliasing Rule and Why do we care? (a better title could have been "pun intended" by the way ;) ) |
Hmm, I've been reading a lot the last 2 days on this subject and I'm not sure if I ever dare to compile without the Can the pre-compiled pieces of code we link, like the WiFi code (and others?) perhaps also be compiled with this flag set? |
@TD-er only related to the code generation happening with the gcc10.3, which does not include SDK blobs. Specifically, the mmu_{get,set}... group of functionsArduino/cores/esp8266/mmu_iram.h Lines 122 to 198 in 5f04fbb
It is always reading from a valid object though, so pgm side of things of just reading seems to be just fine. Writing and reading is a problem though, where unlike the pgm_... variants, writes to the pointer cause the optimization cutting out code deemed unnecessary, as it operates on the pointer-to-object-through-reinterpret-cast and the object itself separately. BTW, looking at the objdump of the union code above once again, it becomes just a single static inline uint16_t __attribute__((always_inline))
get_uint16_plain_c(const void *p16) {
auto* ptr = reinterpret_cast<const void*>((uintptr_t)(p16) & (uintptr_t)(~3));
uint32_t out;
std::memcpy(&out, ptr, sizeof(uint32_t));
out >>= ((uintptr_t)(p16) & (uintptr_t)(3)) * 8;
return out;
} (and, despite the pgmspace.h comment, memcpy is optimized into 32bit wide read. e.g. adding external func to avoid inline marker, and also with 00000000 <_Z19my_pgm_read_wordPKv>:
0: c37c movi.n a3, -4
2: 103230 and a3, a2, a3
5: 0338 l32i.n a3, a3, 0
7: 142020 extui a2, a2, 0, 2
a: 402200 ssa8l a2
d: 912030 srl a2, a3
10: f42020 extui a2, a2, 0, 16
13: f00d ret.n |
For this specific issue you found , sure, but the reason |
@TD-er It is all there is though Plus, future gcc11 does even more stuff related to aliasing, I'd wager to change assumptions related to object access is a better solution than trying to fight compiler authors. And optimizations could be enabled / disabled on the object file level by using optimize pragmas, if there are some definitive issues with the gcc implementation |
Sure if we know of specific code not working well on these kind of optimizations, then per file excluding some optmizations is the best way to go. |
Since @TD-er and @mcspr thank you for those links and discussions. It was very helpful. To comply with strict-aliasing rules, I will mostly use An example code fragment compling with strict-aliasing rules using a const uint32_t *icache_flash = (const uint32_t *)0x40200000u;
union { // to comply with strict-aliasing rules
image_header_t hdr; // total size 8 bytes
uint32_t u32; // we only need the 1st 4 bytets
} imghdr_4bytes;
// read first 4 byte (magic byte + flash config)
imghdr_4bytes.u32 = *icache_flash;
/*
ICACHE memory read requires aligned word transfers. Because
imghdr_4bytes.hdr.flash_size_freq is a byte value, the GCC 10.3 compiler
tends to optimize out our 32-bit access for 8-bit access. If we reference
the 32-bit word from Extended ASM, this persuades the compiler to keep the
32-bit register load and extract the 8-bit value later.
*/
asm volatile ("# imghdr_4bytes.u32 => %0" ::"r"(imghdr_4bytes.u32));
flashchip->chip_size = esp_c_magic_flash_chip_size(imghdr_4bytes.hdr.flash_size_freq >> 4); An example compling with strict-aliasing rules using static inline __attribute__((always_inline))
uint8_t mmu_get_uint8(const void *p8) {
void *v32 = (void *)((uintptr_t)p8 & ~(uintptr_t)3u);
uint32_t val;
__builtin_memcpy(&val, v32, sizeof(uint32_t));
asm volatile ("" ::"r"(val)); // helps ensure we keep the 32-bit load operation
uint32_t pos = ((uint32_t)p8 & 3u) * 8u;
val >>= pos;
return (uint8_t)val;
} And a thank you to @jjsuwa-sys3175 for introducing me to the concept of "injecting dependency" by way of unused input variables in Extended ASM. |
These changes are needed to address bugs that can emerge with the improved optimization from the GCC 10.3 compiler. Updated performance inline functions `mmu_get_uint8()`, ... and `mmu_set_uint8()`, ... to comply with strict-aliasing rules. Without this change, stale data may be referenced. This issue was revealed in discussions on #8261 (comment) Changes to avoid over-optimization of 32-bit wide transfers from IRAM, turning into 8-bit or 16-bit transfers by the new GCC 10.3 compiler. This has been a reoccurring/tricky problem for me with the new compiler. So far referencing the 32-bit value loaded by way of an Extended ASM R/W output register has stopped the compiler from optimizing down to an 8-bit or 16-bit transfer. Example: ```cpp uint32_t val; __builtin_memcpy(&val, v32, sizeof(uint32_t)); asm volatile ("" :"+r"(val)); // inject 32-bit dependency ... ``` Updated example `irammem.ino` * do a simple test of compliance to strict-aliasing rules * For `mmu_get_uint8()`, added tests to evaluate if 32-bit wide transfers were converted to an 8-bit transfer.
These changes are needed to address bugs that can emerge with the improved optimization from the GCC 10.3 compiler. Updated performance inline functions `mmu_get_uint8()`, ... and `mmu_set_uint8()`, ... to comply with strict-aliasing rules. Without this change, stale data may be referenced. This issue was revealed in discussions on esp8266#8261 (comment) Changes to avoid over-optimization of 32-bit wide transfers from IRAM, turning into 8-bit or 16-bit transfers by the new GCC 10.3 compiler. This has been a reoccurring/tricky problem for me with the new compiler. So far referencing the 32-bit value loaded by way of an Extended ASM R/W output register has stopped the compiler from optimizing down to an 8-bit or 16-bit transfer. Example: ```cpp uint32_t val; __builtin_memcpy(&val, v32, sizeof(uint32_t)); asm volatile ("" :"+r"(val)); // inject 32-bit dependency ... ``` Updated example `irammem.ino` * do a simple test of compliance to strict-aliasing rules * For `mmu_get_uint8()`, added tests to evaluate if 32-bit wide transfers were converted to an 8-bit transfer.
Basic Infos
Platform
Settings in IDE
Problem Description
With the newer "xtensa-lx106-elf-gcc (GCC) 10.3.0" I have had problems from time to time, that I have dismissed as being an error in my use of Extended ASM. It now appears that may not have been completely true. The MCVE below builds and executes correctly with
-O0
or the old "xtensa-lx106-elf-gcc (GCC) 4.8.2" compiler with-O0
or-Os
.However, for the case of "xtensa-lx106-elf-gcc (GCC) 10.3.0" and
-Os
it generates bad code that does not properly update the array as it executes a double for loop.This MCVE is an extreme minimization of the original
bearssl
experiment that was failing for me. There is always a chance I minimized out the failure case and added my own bug. However, I think I got everything cleaned up.MCVE Sketch
Avoid the compiler defeating our test and optimizing to an answer by place this code with sketch In a separate
.c
file:Debug Messages
Sketch output when working will look something like this:
Sketch failing output will look something like this (
-Os
):Working code generated under Arduino ESP8266 Core 2.7.4
xtensa-lx106-elf-gcc (GCC) 4.8.2 with
-Os
optionDisassembly of section .text.test:
Failing code generated under Arduino ESP8266 Core current master
xtensa-lx106-elf-gcc (GCC) 10.3.0 with
-Os
optionDisassembly of section .text.test:
The text was updated successfully, but these errors were encountered: