-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack usage during serialization is significantly higher than v6 #2046
Comments
I tried using ghidra to inspect the stack frames. It looked to me that the difference compared to v6 seems to be a combination of both larger sizes for JsonObject/JsonArray, and a plethora of FlashString objects left on the stack. It seems gcc isn't able to recognize that the FlashStrings are temporary and the stack can be recovered. Passing -fstack-reuse=all did not help. |
As a workaround on ESP8266es, increasing the stack size with |
This reduces stack consumption and code size. See #2046
Hi @willmmiles, Thank you very much for reporting this issue. I turn your repro code into the following program: #include <Arduino.h>
#include <ArduinoJson.h>
#include <cont.h>
void serializeConfig() {
DynamicJsonDocument doc(1024);
JsonArray rev = doc[F("rev")].to<JsonArray>();
rev.add(1);
rev.add(0);
doc[F("vid")] = ARDUINOJSON_VERSION;
JsonObject id = doc[F("id")].to<JsonObject>();
id[F("mdns")] = "cmDNS";
id[F("name")] = "serverDescription";
id[F("sui")] = "simplifiedUI";
serializeJson(doc, Serial);
}
void setup() {
Serial.begin(9600);
while (!Serial)
continue;
cont_check(g_pcont); // Validate we haven't already overflowed
cont_repaint_stack(g_pcont); // Fill unused stack with guard value
auto unused_stack_before =
cont_get_free_stack(g_pcont); // Counts stack filled with guard value
serializeConfig();
cont_check(g_pcont); // Validate we haven't overflowed
auto unused_stack_after = cont_get_free_stack(g_pcont);
Serial.printf("Stack used by serializeConfig: %d\n",
unused_stack_before - unused_stack_after);
}
void loop() {} With it, I got the following results:
We can see an increase, but it's not that big. I also did some tests by modifying the library to print the stack pointer address and got the following results for six levels of nesting:
We can see a significant increase, but only in debug mode. c98b05e corresponds to the current head of branch Best regards, |
Thanks for taking such a detailed look, I really appreciate it. I'm using release builds, and unfortunately c98b05e made no significant difference for the whole config. I finally instrumented my project properly using the logic from above, with the full config, and got the following results:
Possibly there's some kind of phase change in the compiler output if the function gets too big (eg. the optimizer times out and gives up partway, or something). Here is a copy of the whole cfg, for reference: cfg-tmp.json, and a link to the generating code, in case you can spot some obvious usage error: https://github.com/willmmiles/WLED/blob/c2d2000f9b91fcb02fd37b7f255911ff4fe5b5ac/wled00/cfg.cpp#L640 |
Here are the variables scoped to JsonObject root;
JsonArray rev;
JsonObject id;
JsonObject nw JsonArray nw_ins;
JsonObject nw_ins_0;
JsonArray nw_ins_0_ip;
JsonArray nw_ins_0_gw;
JsonArray nw_ins_0_sn;
JsonObject ap;
JsonArray ap_ip;
JsonObject wifi;
JsonObject ethernet;
JsonObject hw;
JsonObject hw_led;
JsonArray hw_led_ins;
JsonArray hw_com;
const ColorOrderMap &com;
JsonObject hw_btn;
JsonArray hw_btn_ins;
JsonObject hw_ir;
JsonObject hw_relay;
JsonObject hw_if;
JsonArray hw_if_i2c;
JsonArray hw_if_spi;
JsonObject light;
JsonObject light_gc;
JsonObject light_tr;
JsonObject light_nl;
JsonObject def;
JsonObject interfaces;
JsonObject if_sync;
JsonObject if_sync_recv;
JsonObject if_sync_send;
JsonObject if_nodes;
JsonObject if_live;
JsonObject if_live_dmx;
JsonObject if_va;
JsonArray if_va_macros;
JsonObject if_mqtt;
JsonObject if_mqtt_topics;
JsonObject if_hue;
JsonObject if_hue_recv;
JsonArray if_hue_ip;
JsonObject if_ntp;
JsonObject ol;
JsonObject timers;
JsonObject cntdwn;
JsonArray goal;
JsonArray timers_ins;
JsonObject ota;
JsonObject dmx;
JsonArray dmx_fixmap;
JsonObject usermods_settings;
File f;
Unfortunately, this doesn't explain the increase between v6 and v7, but you could easily save these 468 by splitting this function or calling |
The disassembly analysis suggested that the compiler is also storing all the temporaries at unique stack addresses, too -- every object looked up with I can definitely work around the issue by breaking up the large configuration to smaller functions (assuming the compiler doesn't undo it all by inlining them!). I'm not quite ready to give up yet, though - I'd hate to be the guy debugging some project that suddenly got unstable when I added one more entry to the configuration file, or to have to document "you must break up large serializations in to small functions because otherwise it might crash on some systems". |
You were on the right track with looking at inlining. It looks like gcc (at least the version in this toolchain) isn't able to share space in the stack frame for inlined functions -- every inlined copy has to be allocated unique stack space for their own arguments. This is the cause of the additional stack usage. As an experiment, I tried globally disabling FORCE_INLINE in ArduinoJson, in favor of letting the compiler decide when to inline. The results was a large reduction in stack usage (down to 1572 bytes!!), a 14528 byte reduction in program size, and no noticable performance change for my use case. If it's helpful, the compiled |
That's the opposite of my experience with |
Sure. I won't get to it today, but tomorrow I'll see about scripting up a systematic search. I'm pretty sure we can use -Wframe-larger-than to get gcc to print out the stack frame size without even running the code. |
I did some scripting to find which I compiled in the following table the impact of each on the size of ArduinoJson's examples:
In this table, a negative number means the size decreases (good) and a positive number means that the size increases (bad). I already remove the 103 |
I asked the ArduinoJson Assistant to write the program to generate the same content as the cfg-tmp.json you uploaded earlier. Then, I ran my script again to see the effect of each
We can see that some of them have a very negative impact on code size. |
Yes, I'm thinking there's a matter of scale here -- if we inline a function a few times, it doesn't matter much, but if we have to inline it hundreds of times, the extra code can really add up. To make things even more complex, large projects like WLED use ArduinoJson in several different contexts (and more relevantly, translation units). Being able to share common code can be a huge savings. Is your script knocking out FORCE_INLINE or adding it only on the one line? I'm running a data collection now over the whole WLED project, selectively removing FORCE_INLINE in each location and collecting program size and stack usage metrics. I might try going the other way (selectively adding it) for comparison. |
My script removes one Before making your tests, do you want me to push a new version without the seven |
Data collection is done, I'm processing the results now. I'll post them up shortly. We can re-run with the changes after I've done the initial analysis. |
Apologies for the monster table. This is the initial result set:
|
Script code in context is here: https://github.com/willmmiles/WLED/blob/arduinjson-v7-dev/arduinojson_test.py |
Thank you for this fantastic work! ❤️ I thought you wanted to compile with only one I initially thought we could beat the compiler by adding judiciously placed After seeing your results, I think I'd better remove all the Let me know what you think. |
Ah, sorry for the confusion - I wanted to try the knock-out approach, like you did, first; I was thinking about trying "start with none, add one at a time" as a follow up. Given the test build framework, I had a half-baked idea to run a simulated annealing type search. Although I fear a deeper search might lend itself to overfitting for the test sample. It's definitely interesting how removing even one FORCE_INLINE on the "critical path" can make such a difference to the output on a large project. I think there may not be a one-size-fits-all solution. It could be the right answer is some level of configurability, as much as I hate the complexity of tunables. For example, in WLED, we have many translation units doing JSON, so there's a big size advantage to sharing code. A smaller project might have just one translation unit, where sharable functions just waste space. Alas, I don't think link-time optimization is available for many of these embedded platforms. :( Bottom line: I think letting the compiler make the call is probably the best for unknown arbitrary use cases. Given that you've already got the markup in the code, though, I think having a |
I finally decided to keep twelve I force-pushed this commit because I made a mistake in the previous one, so you may need to do a hard reset. Thank you again for your indispensable contribution! |
The fix is available in ArduinoJson 7.0.3. |
Thank you! It was a pleasure to work with you. :) |
Describe the bug
Serialization using ArduinoJSON v7 on ESP8266 seems to be using substantially more stack than v6 for common serialization tasks. With a large configuration structure, this can result in a stack overflow, causing difficult to diagnose crashes.
Troubleshooter report
Here is the report generated by the ArduinoJson Troubleshooter:
PROGMEM
Environment
Here is the environment that I used:
Reproduction
[TODO] Working on it - I should be able to provide a better (complete and buildable) example in a few days. Serialization code is of the form:
The text was updated successfully, but these errors were encountered: