-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guru meditation #337
Comments
This has been observed since the first versions. Most propably you are right with the assumption of running out of memory. Anyway a warning is given on top of the readme, that extensive web access can crash the system. Usually the reboot is quite fast and with prevalue turned on, there is almost no noticable effect to the end user. |
but what you could do is keep TF from running if web access is detected. Only start TF processing one minute after the last web access. That should ease things a lot. Especially when setting up the first time, I can hardly click 2-3 links before it becomes unavailable. Use a timestamp that updates on every webserver access and compare against it when image processing runs. |
I did some investigation, the errors that occur have a distinct probability. Most popular crash was (20 times): 4 times: 2 times: once each: PC: 0x401a76c1: tflite::(anonymous namespace)::Eval(TfLiteContext*, TfLiteNode*) at components/tfmicro/tensorflow/lite/micro/kernels/conv.cc line 55 PC: 0x4019da06: tflite::internal::GetFlatbufferTensorBuffer(tflite::Tensor const&, flatbuffers::Vector > const*) at components/tfmicro/tensorflow/lite/micro/micro_allocator.cc line 410 PC: 0x4018b6b7: ClassFlowControll::getReadoutAllabi:cxx11 at /home/john/.platformio/packages/toolchain-xtensa32/xtensa-esp32-elf/include/c++/8.2.0/bits/stl_vector.h line 805 PC: 0x4019d714: flatbuffers::Vector ::Get(unsigned int) const at components/tfmicro/third_party/flatbuffers/include/flatbuffers/flatbuffers.h line 261 PC: 0x401bd1b1: tflite::BytesRequiredForTensor(tflite::Tensor const&, unsigned int*, unsigned int*, tflite::ErrorReporter*) at components/tfmicro/third_party/flatbuffers/include/flatbuffers/flatbuffers.h line 2468 flatbuffers::Offsettflite::SubGraph; flatbuffers::Vector::return_type = const tflite::SubGraph*; flatbuffers::uoffset_t = unsigned int] |
@s0170071 the tflite and flatbuffers are part of the TensorFlow library. therefore internal debugging is a pretty high effort. Do you have any ideas, how to improve this? |
I looked into the code, looks like we have plenty of memory. Looks more like invalid objects and so.
I am pretty sure this should be |
Your totally right - corrected in current rolling |
Can you check in ClassFlowPostProcessing.cpp line
? |
|
Its the line (ClassFlowControl.cpp) that causes the crash. To be more precise, calling flowpostprocessing->GetNumbers(); causes it. The flowpostprocessing pointer seems to be available, i.e. it is not NULL. But you must not access NUMBERS. Returning NULL in GetNumbers() and not using that any further fixes the crash. Furthermore, NUMBERS is a vector that contains pointers to multiple NumberPost structs. That struct itself is dangerous as it contains strings.
Strings in structs are just pointers to a string object itself, which is variable in size. I did put a printf debug instruction into GetNumbers()
What I can't figure out is why that printf only shows once after power up and calling the website. Successive website calls do not trigger it. Maybe related to optimization. |
That is strange. Currently I don't see a reason why Removing all strings in the code is pain in the ass. I'm not considering this yet at all. |
And yet, this seems to be the root of all evil. If I return NULL in getNumbers, I cant get it to crash, no matter how hard I try. Is it running tf in that state ? |
in ClassFlowControll.cpp, the code is :
The line near the end
is critical, since flowanalog and flowdigit may be uninitialized. |
No, this it not critical, as they are initialized with
|
I thought a bit about it. The vector array can be pretty big (depending on the number of numbers and ROIs). Maybe it would be better, just to work with a pointer to the NUMBERS, to avoid memory copying. Just uploaded a new rolling version with this change. |
lf I understood your code right, then I put a printf() in |
No, flowpostprocessing is initiated during the reading of the But you are totally right:
|
If there is no postprocessing section in the config file, |
Did you click the "Overview" menu and look at the serial output ? I didn't try your latest code, but my copy crashes very reliably if I click it a couple seconds after boot. |
If there is no postprocessing, the whole system will not make any sense and postprocessing cannot be disabled in the graphical setting. So this should not happen. |
Yes, but if you push it more than 1x during the neural network calculation it still restarts. Looks like a memory overflow in the combination of the http server and the neural network calculation. |
Do you have a jtag debugger?
Am 21. September 2021 21:53:12 schrieb jomjol ***@***.***>:
… Yes, but if you push it more than 1x during the neural network calculation
it still restarts. Looks like a memory overflow in the combination of the
http server and the neural network calculation.
Really hard to debugg!
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#337 (comment)
|
To me it looks like a pointer problem. Heap is always > 10k and I have seen
esps running with much less.
Try allocating some memory for dummy array and see how large it can get
until tf does not work anymore.
Make sure you make it volatile and access it at least once
Am 21. September 2021 21:53:12 schrieb jomjol ***@***.***>:
… Yes, but if you push it more than 1x during the neural network calculation
it still restarts. Looks like a memory overflow in the combination of the
http server and the neural network calculation.
Really hard to debugg!
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#337 (comment)
|
Another thing that I observed:
later, these get processed:
right before the next Guru, in It shows 4. Shouldn't that be 5?
|
I can reproduce the crash, if I push "Overview" several times during the neural network detections. Otherwise it is running smooth. I assume a memory problem within the tflite code. This I cannot debug, as I don't have much insight into this. I will try to use a more recent version. Maybe this is more stable with respect to this. |
I ordered a debugger. According to this there is no memory at this location. How do you feel about making some of the objects global / static ? |
Global and static could be a solution. But I need to create the content dynamically (don't know, how many maximum ROIs are possible), it might only help partially. Currently I'm cleaning up parts of the code to reduce the dynamic memory usage (remove copy of strings and use pointers to the source, ...). Maybe that helps a bit. |
While you're at it.... also replace string=string+string with string
+=string. Its more efficient. Doesn't create a duplicate on the heap.
Am 24. September 2021 19:10:06 schrieb jomjol ***@***.***>:
… Global and static could be a solution. But I need to create the content
dynamically (don't know, how many maximum ROIs are possible), it might only
help partially.
Most propably it is a leakage in the server. Because it usually only
crashes when you access the web page. If I leave it alone (only MQTT
transfer), it usually runs without problems for more than 100x times. If it
reboots 1x/2x a day, that is not really a problem.
Currently I'm cleaning up parts of the code to reduce the dynamic memory
usage (remove copy of strings and use pointers to the source, ...). Maybe
that helps a bit.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#337 (comment)
|
My debugger (esp-prog) arrived. I got it up and running with a easy blink example on an ESP32CAM module. The process is not straight forward. See here for a collection of pit-falls that I encounterd. Took me two days. Next stop: running the AIontheedge software with the debugger. Give me some days, I'll keep you posted. |
Great news - looking forward for the update! Update: you need to compile the AI-on-the-edge with esp-idf version 2.1.0 (=espressif 4.1). With the 4.3 it will not run. |
Figured that. I think some project files are messed up. Have you considered to update the esp-idf version ? I tried, it shows some error with the exceptions. But maybe some of the trouble goes away with the up-to-date version ? |
LOL, I just compiled with the latest IDF. I remove the exceptions, diabled the certificates -> compiles ok but has 100.7% flash memory usage. WTF! |
Oh no... https://community.platformio.org/t/issue-with-esp32-jtag-scan-chain-interrogation-failed/5267/10 here they say that the SD card uses the jtag pins. That leads to an error where the JTAG complains I checked the ESP32cam schematic. This board also uses the JTAG pins for the SD card. Apart from that, I got the current IDF running, but it crashes frequently. With it it never goes beyond [Alignment]. But on the bright side, this error is pretty reliably, even without web access. Fixing it would probably be a good thing. |
When accessing the web interface when the DNN processing takes place, the ESP crashes. Serial log shows:
IMHO, that triggers a déjà vu. The ESPEasy folks had a similar issue some time ago. Long story short: the ESP ran out of memory during a webserver request. ..
The text was updated successfully, but these errors were encountered: