-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UBLOX_EVK_ODIN_W2: network-wifi test crashes #10815
Comments
cc @ARMmbed/team-ublox |
Internal Jira reference: https://jira.arm.com/browse/MBOCUSTRIA-1279 |
I am not able to reproduce above scnerio even removed two antennas of AP out of three no crash seen |
@kimlep01 have you tried with ARM and IAR also? I coundn't see it |
Yes we run those tests with GCC, ARM and IAR binaries. I have seen crashes with all compilers. I was hoping that those line numbers would help tracking down why those assert failures happen. We have similar issues open with mtb ublox odin w2 also. My guess is that something goes wrong when the board tries to connect to a heavily utilised access point. |
Hi @ARMmbed/team-ublox This issue is still visible in our nightly tests. I have been able to reproduce this issue manually on my desk also:
I ran the test a few times and got these three different crashes:
mbed-os commit:
Contents of my mbed_app.json used (just password and unsecure AP name changed from our CI mbed_app.json):
|
@kimlep01 Ok we will generate fix in next PR |
The error codes identical as in #9621 . The code line has changed. Perhaps that will help? |
Still managed to reproduce the crash. Same mbed_app.json used as above, edited main.cpp to run CONNECT-DISCONNECT-REPEAT 20 times in a row, compiled with IAR 8.32.1 and ran on my desk with my phone as a Wi-Fi hotspot. Crash happened on the third time. Logs:
|
@kimlep01 yes let me reproduce it on my desk and see if can give some fix. |
Retested the PR with the following changes:
I could not see any more crashes. Tests would start failing pretty fast sometimes but no more crashing:
|
@kimlep01 tests starts failing because of weak signal strength returning no connection status. |
Hi @kimlep01 can u add below hack `void OdinWiFiInterface::handle_wlan_status_connected(wlan_status_connected_s *wlan_connect)
` in OdinWifiInterface.cpp and test once again it may solve problem, please give feedback i would generate PR after that |
Retested with latest mbed-os master with this change:
I first made sure I could reproduce the issue still with clean master. I couldn't do that on my desk for some reason so I ran the test in the lab where our nightly tests run also. Test crashed on clean master and with the above change also. The line number was the same in both cases. Error message:
|
Hi @kimlep01 i am sharing private branch for testing please verify above issue: |
Can be reproduced with the above mentioned branch:
|
@kimlep01 please see logs below, i am not getting any crash tested it more than 10 times, can u please see if J21 is short and secondly try with replacing other odin_w2 board. |
We have 4 boards running in the lab. I tested all of them: The boards have jumper J21 shorted. I made a request to the infrastructure team to replace them just to be sure which will take a few days as I don't have access there. |
Changing the J21 jumpers did not help and I still could reproduce this both in CI and locally on my desk using the private branch. I ran the same test (20x connect-disconnect-repeat) with pyOCD debugger attached but all I could see is that on line 758 of OdinWiFiInterface.cpp cbWLAN_disconnect(handle) returned cbSTATUS_ERROR. No visibility beyond that.
This might feel tricky to reproduce sometimes but here is what I suggest:
The other way would be to provide a debug version of the driver for us with enough traces so we could help more. (I will be out of office until 7.10. Team ARMmbed/mbed-os-wan can assist until then if required) |
Hi @kimlep01 Thank you for writing the detailed procedure to reproduce the issue. We were in fact able to reproduce it and have hopefully fixed it. Hereby sharing our private branch with you: ublox_odin_driver_os_5_v3.7.1_rc3 Can you kindly test and let us know about your test results? |
Hello again, I retested with the new branch and unfortunately I could still reproduce the issue. I tested that I could reproduce the issue with all compilers and using two different boards. I don't get any test case errors prior to assert - it passes just fine until mbed assert happens. IAR8:
ARMC6:
GCC_ARM8:
|
Hi @kimlep01 That is strange. After the fix, we ran the test overnight and there was no assert. Can you please tell us how many times you tries running it? Can you again try by building clean, hard resetting the board, shorting the J21 jumper, hard resetting the access point (if possible)? |
I hit the issue on the first or second run. Now it seems to assert reliably on line 761. I did a clean clone and boards are powered off before test runs. I tested locally with two boards (W260 & W262), changed different jumpers to J21, restarted my phone (S10e) before tests which acted as an unsecure AP. I also tested inside a RF-BOX with the same outcome (couldn't restart AP there). The mbed_app.json used can be found from my fourth comment (just change unsecure AP name). Here is the modified test case content, I seem to have increased the timeout also:
|
Thank you for these insights. If you hit this on first or second run, that's worrisome. We are barely able to reproduce it anymore. And even if we get the assert, it is after 700-800 connect/disconnect sequences. We are trying to create the scenario here which causes this assert to occur (like breaking the link in between handshake). So far we have observed that if the Bluetooth of the phone is turned on as well while phone is acting as an AP, it becomes unstable and there are a lot more failed connect attempts (not asserts). Can you help us identify a low-level scenario in which this can be reproduced? Like sharing the filtered Wireshark logs of packets to and from ODIN on the channel it tries to connect/disconnect. If you have any other observations that can help us reproduce this will be helpful. |
Hi @hamza-ubx . Thanks a lot for your efforts so far and the helpful PR, which improved the situation. I added some more logging, modified the From time to time connect fails with AUTH_FAILURE. I can see this log line in the logs like this:
Usually however, the next time we try to connect, after previous timeout/authentication error the connect goes into the assertion in To sum up:
I don't think it is LWIP not being able to set ip address that is the root cause. It is OK for the module to sometimes fail to connect to AP in a busy network and this should not lead to assertion. |
Thank you for the detailed analysis. We are continuing the investigation at our end and will share any information that we discover. |
So we have investigated the issue in more detail. For the first part where you mentioned that may be translation from driver to Mbed's error code is not correct. When driver returns one of the following, only then it is translated to
For the connection timeout, its not handled at driver level instead it is being handled at interface level. See here. Hence, we can correct this error translation at interface level. For the second part where you mentioned that assert is still reproducible but relatively rare now and it mostly occurs when the previous iteration of connect/disconnect fails with -3011. We tried to reproduce this scenario but were unable to replicate. Whenever -3011 occurred, next iteration succeeded usually. However, assert did occur sometimes around consecutive 800 to 1200 connect/disconnect cycles. But the previous iteration after which the assert occurred was always successful connection/disconnection. For assert part, we still haven't found a concrete scenario under which it is absolutely reproducible. It just occurs randomly sometimes in 100's and sometimes in 1000's connect/disconnect cycle and that too with only some of the APs (mobile phone access points mostly). Other APs work just fine and no assert have been observed upto 10,000 connect/disconnect cycles. Have you found the concrete scenario under which it can be reproduced? |
Hi @hamza-ubx . Not sure I got it right, but indeed, sounds like only What about the two other errors: Regarding the original crash issue... |
I guess we can correct the error translation after we come up with appropriate meaning for each error code and find relevant For For For Regarding original crash issue, we are trying different things to reproduce the crash including the one mentioned in #9621. Hopefully this translation correction will also help analyze it better. |
Hi @hamza-ubx . Regarding While we are at this point, I am also thinking if It's also wise to ask @kjbracey-arm for an opinion to make sure we don't get something horribly wrong. Kevin, would you please let us know if these error code translations mentioned above make sense to you? @hamza-ubx , fingers crossed for the crash fix. If there is anything we can help with - let us know. For the reproduction scenario, I guess we are on the same page - the busier network, the weaker signal and the more extreme its handling and location, the more likely the crash becomes. I guess once per hour is not that bad... |
Great! We will look forward to hear from Kevin and then make the corrections to the translation as discussed. About the And yeah I guess we are on the same page for the crash issue reproduction. Hoping to find concrete scenario to reproduce and then find a fix. |
Hi @hamza-ubx , any success regarding the crash issue? @kimlep01 , I looked briefly at the nightly results, but couldn't see the crash in there. Have you applied some workaround? Is the issue still visible? |
Hi, the original issue is still visible in our CI. Happened twice last night in GCC_ARM-tests-netsocket-dns and in GCC_ARM tests-network-wifi.
|
We are seeing this issue occur relatively less frequently now since the last optimizations have been applied. However, issue is still there and we would like to properly fix it. This is a tricky one and requires dedicated effort. We are prioritizing it and will keep you posted as soon as we have a fix. |
Description
UBLOX_EVK_ODIN_W2 target crashes in our nightly CI when running network-wifi test. We have not seen this earlier as the test was running inside of a rf shielded box. Now the tests are running in a very noisy environment which most probably causes these crashes when the Odin tries to connect to the network. I can reproduce the crash from line 1826 when I re-run the test inside the test farm. Does not reproduce on my desk (not enough noise I guess).
More common crash:
This crash has happened only in CI runs:
Edit - found one more crash:
mbed-os version (master):
The crashes don't seem to be compiler or test case specific. I have seen these crashes with several different wifi tests cases from this same test suite.
How to reproduce:
Use a VERY noisy test environment for Wifi
Edit tools/test_configs/WiFiInterface.json and give proper values for:
wifi-secure-ssid
wifi-unsecure-ssid
wifi-password
wifi-ch-secure
wifi-ch-unsecure
mbed test --compile -t GCC_ARM -m UBLOX_EVK_ODIN_W2 -n tests-network-wifi --app-config=tools/test_configs/WiFiInterface.json -DMBED_HEAP_STATS_ENABLED=1 -DMBED_STACK_STATS_ENABLED=1 -DMBED_TRAP_ERRORS_ENABLED=1 -DMBED_ALL_STATS_ENABLED -DSKIP_TIME_DRIFT_TESTS=1
mbedgt -m UBLOX_EVK_ODIN_W2 -n tests-network-wifi -V
(crashes 2/3 times in our environment)
Issue request type
The text was updated successfully, but these errors were encountered: