Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Aliasing] Prevent aliasing issues + remove C-style casts #3765

Merged
merged 5 commits into from
Aug 15, 2021

Conversation

TD-er
Copy link
Member

@TD-er TD-er commented Aug 14, 2021

  • [Build] Add -fno-strict-aliasing as build flag make builds more stable
  • [Casts] Change C-style (uint8_t*) casts into reinterpret_cast
  • [Casts] Change C-style casts (int) and (float) into static_casts

TD-er added 4 commits August 14, 2021 12:07
It may prevent optimisations where the compiler may cause issues when allocated memory is not aligned for 4-byte access when storing/reading int from a `char array` allocated memory block for example.
@TD-er TD-er changed the title [Aliasing [Aliasing] Prevent aliasing issues + remove C-style casts Aug 14, 2021
@TD-er
Copy link
Member Author

TD-er commented Aug 14, 2021

Would really appreciate if one or two people could check the commits here to see if I made obvious errors in replacing the C-style casts.

@tonhuisman
Copy link
Contributor

Besides that 1 remark, no obvious mistakes that I can detect 👍

@TD-er TD-er merged commit e60b20d into letscontrolit:mega Aug 15, 2021
@TD-er TD-er deleted the build/Aliasing branch August 15, 2021 21:48
@mcspr
Copy link
Contributor

mcspr commented Aug 16, 2021

re: esp8266/Arduino#8261 (comment)
does this -fno-strict-aliasing patch actually... help with anything? with String, crashes, incorrect results in sensor drivers, etc.

@TD-er
Copy link
Member Author

TD-er commented Aug 16, 2021

@mcspr The only thing we can do is actually try it and see if we still encounter "strange behavior".
Like I said, every now and then (with the core 3.0.x a lot more frequent) there is a build which does crash seemingly random.
Sadly I don't know any other way to test the effectiveness of this flag by concluding after some time that we have not seen any of such crashes in a long time.
When developing, I always flash directly via serial, so the in another issue mentioned stability issues after OTA flashing will not play a part in my testing.
Last few days I did build a number of test builds with my core 3.0.x PR and none of those resulted in those strange crashes.
So the only thing I can conclude right now is that it doesn't harm my builds and it is not yet proven to "not work".

The main problem I'm having with those hard to reproduce crashes is that I truly have no clue why they happen.
I did report quite a few decoded stacks, but quite a few were dismissed by the esp8266/Arduino folks as unusable, or inconclusive. So I'm not even sure whether I can trust those decoded stacks.
But they do seem to point to crashes in either de destructor of String or trying to access the wrong kind of memory.
Both can be caused by alignment issues, but also if a String object is not constructed well. (e.g. try to "emplace_back" an object with String members to a std::list where memory allocation may be failing)

@mcspr
Copy link
Contributor

mcspr commented Aug 16, 2021

The main problem I'm having with those hard to reproduce crashes is that I truly have no clue why they happen.
I did report quite a few decoded stacks, but quite a few were dismissed by the esp8266/Arduino folks as unusable, or inconclusive. So I'm not even sure whether I can trust those decoded stacks.

Memory errors are inconclusive though, and if it's something like that it is pretty hard to track only by the trace itself. btw, I'd
still strongly suggest using -DUMM_POISON_CHECK -DDEBUG_ESP_OOM -DDEBUG_SERIAL_PORT=Serial if serial debug is possible, at least that might catch some oob writes / reads issues at a much earlier point. Only possible locally, though

Originally, I thought of reinterpret_casts of F(...) -> string, but those seem to be correct b/c conversion happens const char* -> flashstringhelper* -> const char* again, so it does not seem to violate any rules regarding the object(s) lifetime.

...try to "emplace_back" an object with String members to a std::list where memory allocation may be failing)

Could only find a single entry of std::list, were the traces pointing there?
https://github.com/letscontrolit/ESPEasy/blob/54bf060/src/src/DataStructs/EventQueue.h

btw, I would not mind at-ping if the issue happens again (since I am not subscribed to the repo). I do suspect some linker issues with pio builds, since I happened on some weird problem where F(...) / PSTR(...) macros returned nullptr, which broke strlen_P by dereferencing nullptr. Hard to detect as well, and probably happens when it decides to keep the existing build directory....

@TD-er
Copy link
Member Author

TD-er commented Aug 16, 2021

Hmm nullptr when using F()/PSTR() is an interesting one.
If strlen_P breaks on a nullptr, then that should be fixed IMHO.
Are those crashes nullptr crashes, or an uninitialized pointer?

I have suspected PIO linker issues in the past, but was never able to prove it.
It would explain why a clean rebuild does work, but there are many explanations possible for it.

And the emplace_back is no longer present in my code, because of those crashes when running low on memory.
It was just an example of things going horribly wrong leading to crashes in seemingly unrelated parts of the code.
The code (and number of checks in my code) has been improved quite a lot, with lots of checks on available memory or low memory usage in mind.

@TD-er
Copy link
Member Author

TD-er commented Aug 16, 2021

Oh almost forgot...
Not sure if that's also related to "bad linking" or "incomplete rebuild", but I do have the feeling functions with the same name, only differing in the signature, are not always linked as intended.

For example:

void foo(const char* text);
void foo(const __FlashStringHelper *  text);

// Call the function:
foo(F("bar"));

If you use _P functions in the one with the flash string helper and the regular ones with the simple char pointer, you would expect things to work as intended.
However you may still get crashes related to using non-_P functions on flash strings.
As if the wrong function was linked.

@mcspr
Copy link
Contributor

mcspr commented Aug 16, 2021

Hmm nullptr when using F()/PSTR() is an interesting one.
If strlen_P breaks on a nullptr, then that should be fixed IMHO.
Are those crashes nullptr crashes, or an uninitialized pointer?

something like that. under unknown conditions, this piece of code triggered an error on strlen_P(...) breaking on PSTR(...) result (i.e. debugSend_P(PSTR("blah blah blah\"))). did not really make much sense at a time, and was not really reproducible after a couple of tries changing build settings. plus strlen / strlen_P are intentionally not checking for nullptr, it's not really a bug from the function pov
https://github.com/xoseperez/espurna/blob/52b35cf35a9bec2c9d92db9b1dced44f20a2b0db/code/espurna/debug.cpp#L124-L140
(thankfully, I do know now that stack var-len-arrays are bad and strlen_P is also useless in that context, since there is vsnprintf_P. will see if that changes anything)

and the most fun one to find out this appended nullptr pointer to commands list:
https://github.com/xoseperez/espurna/blob/d9662bd66ae9f902707f393a607a07ba713e1199/code/espurna/terminal.cpp#L773-L775
which could've been some memory corruption... but I do remember grep'ing by method of exclusion for the missing string and it wasn't there

Still not being able to track that down 🤷
But, I do have some more guesses. Like class FlashStringHelper; declaration here:
https://github.com/esp8266/Arduino/blob/5f04fbbf5fa77c5e71172f4032df095af3766e25/cores/esp8266/WString.h#L38
Without actually declaring it as an empty struct / or { const char* const ptr; } (maybe? per aliasing things, and etc. will need to re-read the documents once more)
https://en.cppreference.com/w/cpp/language/type#Incomplete_type

Wouldn't explain the missing PSTR(...) though.

@TD-er
Copy link
Member Author

TD-er commented Aug 16, 2021

But, I do have some more guesses. Like class FlashStringHelper; declaration here:
https://github.com/esp8266/Arduino/blob/5f04fbbf5fa77c5e71172f4032df095af3766e25/cores/esp8266/WString.h#L38
Without actually declaring it as an empty struct / or { const char* const ptr; } (maybe? per aliasing things, and etc. will need to re-read the documents once more)

Isn't that just like a forward declaration so you don't need to include the header file where that class is defined in the .h file of WString ? Just a forward declaration, without actual definition, is perfectly fine as long as you don't need to instantiate an object of that type.
For example you can define a class with some member being a pointer to some other class without the need to include its .h file in the .h file of your class.
But when you need to know about this other class's internals (e.g. addressing a member or instantiate an instance of it) you need to know its definition.

Only difference here is that it is never actually defined, so you can never create an instance of it, only have a pointer to it.
It is used here more like a typed void pointer.

@TD-er
Copy link
Member Author

TD-er commented Aug 16, 2021

something like that. under unknown conditions, this piece of code triggered an error on strlen_P(...) breaking on PSTR(...) result (i.e. debugSend_P(PSTR("blah blah blah\"))). did not really make much sense at a time, and was not really reproducible after a couple of tries changing build settings. plus strlen / strlen_P are intentionally not checking for nullptr, it's not really a bug from the function pov
https://github.com/xoseperez/espurna/blob/52b35cf35a9bec2c9d92db9b1dced44f20a2b0db/code/espurna/debug.cpp#L124-L140
(thankfully, I do know now that stack var-len-arrays are bad and strlen_P is also useless in that context, since there is vsnprintf_P. will see if that changes anything)

Hmm that one is tricky.
I would store the length of the string to a size_t first and only if it is small enough I would create a stack-allocated array from it.
Otherwise allocate it on the heap, since you're also calling _debugSend with it.
Sounds like a good receipt for stack overflow as it is implemented now :)

@mcspr
Copy link
Contributor

mcspr commented Aug 17, 2021

Only difference here is that it is never actually defined, so you can never create an instance of it, only have a pointer to it.
It is used here more like a typed void pointer.

Yeah, looks like it. It would've triggered incomplete type error if it is ever dereferenced, I just wondered if never declaring it somehow confused the optimizer and made it seem unused.

Another angle is string de-duplication introduced in the 2.6.x, since the linker is supposed to track __pstr__ just as it does with normal inline C-strings [1]. Not a lot of changes in the actual header implementing flash-strings, so compiler might've done something different. I dug up the specific .elf + .bin pair I had failing re. above, with terminal command registration of F("MQTT.INFO") present and F("MQTT.RESET") missing, that were created and used in the same _mqttInitCommands() function. But, I don't think it provides much info without the rest of the build directory, and maybe the single-source build combining .cpp into one messed things up somehow. Since then I'm generating .map file to maybe catch this happening again:
https://gitter.im/esp8266/Arduino?at=606c9a5c0147fb05c5de1d59
https://github.com/xoseperez/espurna/blob/07c04429506059b65f2a64f8833882beb43ee9e2/code/scripts/pio_main.py#L29-L30


[1] https://github.com/earlephilhower/newlib-xtensa/blob/xtensa-4_0_0-lock-arduino/newlib/libc/sys/xtensa/sys/pgmspace.h
(comma after the section injects inline assembly with "aSM", @progbits that are arguments to the .section, ref. https://sourceware.org/binutils/docs/as/Section.html#index-Section-Stack-3)

plus, it's inside a gcc extension block, which has some weird rules about lifetimes
https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html#Statement-Exprs

@mcspr
Copy link
Contributor

mcspr commented Aug 19, 2021

Another angle is string de-duplication introduced in the 2.6.x, since the linker is supposed to track pstr just as it does with normal inline C-strings

Or, maybe not exactly. I dug up at least two .elf files with 'broken' flash string pointer loading, which both seem to try to do this optimization by encoding the value directly in the instruction argument (movi, limited between -2047 and 2047) and then shifting to calculate the actual address (slli). Interestingly enough, it decided to optimize on the first function and the 2nd one still loads the literal pointer address. (l32r pointer to pointer, and then read as char string). Using xtensa-lx106-elf-gdb.exe from the toolchain package.

edit: and my memory had been fuzzy around the actual pointer values, it was not a null but some semi-random number

void _mqttInitCommands() {
    terminalRegisterCommand(F("MQTT.RESET"), [](const terminal::CommandContext&) { ... });
    terminalRegisterCommand(F("MQTT.INFO"), [](const terminal::CommandContext&) { ... });
}
(gdb) disassemble '_mqttInitCommands()'
Dump of assembler code for function _mqttInitCommands():
   0x402083a4 <+0>:     l32r    a3, 0x40208398
   0x402083a7 <+3>:     movi    a2, 0x7a7
   0x402083aa <+6>:     addi    a1, a1, -16
   0x402083ad <+9>:     slli    a2, a2, 17
   0x402083b0 <+12>:    s32i    a0, a1, 12
   0x402083b3 <+15>:    call0   0x40213488 <terminalRegisterCommand(__FlashStringHelper const*, void (*)(terminal::CommandContext const&))>
   0x402083b6 <+18>:    l32r    a3, 0x4020839c
   0x402083b9 <+21>:    l32r    a2, 0x402083a0
   0x402083bc <+24>:    call0   0x40213488 <terminalRegisterCommand(__FlashStringHelper const*, void (*)(terminal::CommandContext const&))>
   0x402083bf <+27>:    l32i    a0, a1, 12
   0x402083c2 <+30>:    addi    a1, a1, 16
   0x402083c5 <+33>:    ret.n
End of assembler dump.
(gdb) x (*0x4020839c)
0x40208380 <_FUN(terminal::CommandContext const&)>:     0x02f0c112
(gdb) p (const char*)(*0x402083a0)
$1 = 0x4026fa70 <_mqttInitCommands()::__pstr__> "MQTT.INFO"
(gdb) x (*0x40208398)
0x40209c70 <_FUN(terminal::CommandContext const&)>:     0x02f0c112

0x7a7 << 17 is 0xf4e0000 though and not some flash location pointer (unless I misunderstood the math around it)
https://github.com/earlephilhower/esp-quick-toolchain/blob/master/patches/gcc10.2/gcc-xtensa-rearrange-DI-mode-constant-loading.patch is in both 10.2 and 10.3 branches, but I wonder if the actual reason is some weird bug in gcc10.2 that was used so far for the esp8266 3.0.x Cores? Not after the #3766 though, but the patch is still there

@mcspr
Copy link
Contributor

mcspr commented Aug 27, 2021

btw, I think this got tracked down to the mingw-w64 aka windows toolchain, but not sure how to proceed just yet:
earlephilhower/newlib-xtensa#19 (comment)
also happens with the current gcc12.0 dev branch. but, note that it's also using the same build env as previous versions, I wonder if mingw config / version / gcc version used by mingw may have something to do with it. could be just a gcc bug, too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants