Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

depthai-core deadlocked in semaphores with OAK-D-Pro-PoE after 100+ connections, #1105

Open
diablodale opened this issue Aug 20, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@diablodale
Copy link
Contributor

See #415
When that repro case is run with depthai-core v2.27.0 for hundreds of connects, the app will eventually hang and not respond.

This is an improvement from #415 when the test failed with only 6 connects.
One test deadlocked in 135 connects. Another test it took 402 connects to deadlock.

The OAK-D-Pro-PoE responds to pings. The device itself could be ok.

The problem appears to be a deadlock in XLink semaphores. By running the test in a debugger I can see 6+ threads and they are all infinitely waiting on XLink semaphores in sem_wait(). Something is not signaling them.

@diablodale diablodale added the bug Something isn't working label Aug 20, 2024
@diablodale
Copy link
Contributor Author

I have isolated and fixed a group of XLink bugs within its Windows implementation for semaphores, pthread conditions, and clocks. After applying multiple fixes, OAK PoE failures declined by a magnitude 🌠

A test run with fixes was able to make 2897 connections in 6.5 hours before failure. At the point of failure, VSCode itself failed and therefore I did not have access to the debugger. I am unclear if VSCode failed and killed the test process, or if the test process died and affected VSCode. Still, the test wrote a CSV log and I see its results...2897 successful test runs in 6.5 hours.

The OAK-D-Pro-PoE in the test has the recent bootloader 0.0.28 from https://github.com/luxonis/depthai-core/releases/tag/v2.26.0. Applying this firmware alone did not have any measureable affect in connection reliability. The magnitude improvement was due to Xlink bug fixes.

There may still be an OAK firmware/bootloader problem. The OAK-D-Pro-PoE after the test failure did not pass spot testing, even after it having no client communicating to it for 2 hours.

  • can be IP pinged and responds to that ping
  • XLink example list_devices reports status: X_LINK_SUCCESS, name: 192.168.2.23, mxid: 18443010318EF50800, state: X_LINK_BOOTLOADER, protocol: X_LINK_TCP_IP, platform: X_LINK_MYRIAD_X
  • But it may not be healthy. depthai-core test xlink_roundtrip_test fails with
    C:\njs\depthai-core\build\tests>xlink_roundtrip_test.exe
    Randomness seeded to: 3520474196
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    xlink_roundtrip_test.exe is a Catch2 v3.4.0 host application.
    Run with -? for options
    -------------------------------------------------------------------------------
    Test XLinkIn->XLinkOut passthrough with random 4000x3000 frame
    -------------------------------------------------------------------------------
    C:\njs\depthai-core\tests\src\xlink_roundtrip_test.cpp(60)
    ...............................................................................
    C:\njs\depthai-core\tests\src\xlink_roundtrip_test.cpp(60): FAILED:
    due to unexpected exception with message:
      No available devices (1 connected, but in use)
    ===============================================================================
    test cases: 3 | 2 passed | 1 failed
    assertions: 3 | 2 passed | 1 failed
    

I repeat that "may not" because the xlink_roundtrip_test test itself has a bug. The depthai device search method occassionally fails with an OAK PoE device. The OAK PoE can not always complete its reboot fast enough for this roundtrip test. I can readily reproduce random failures of this roundtrip test on my OAK PoE sensor even after power-cycling it. Changing envvar DEPTHAI_SEARCH_TIMEOUT=10000 does seem to help...I was not able to readily reproduce a fail of this roundtrip test with the longer timeout.

@moratom
Copy link
Collaborator

moratom commented Aug 22, 2024

That's great to hear @diablodale!

Would you be willing to open a PR to XLink repository with the changes you've made, so we can verify&mainline the fixes?

@diablodale
Copy link
Contributor Author

No PR. Same answer as in March

I don't provide code or detailed bug reports to Luxonis anymore. Your team didn't move on my high-quality PRs and bugs so I retracted/closed many of them and not doing that anymore. themarpe can bring you up-to-speed if you need details.

Fixes are passing my reliability tests. Last test ran 4126 iterations of continuous connect, get data, disconnect, repeat with an OAK-d-pro-poe. Zero delays, errors, faults, or freezes. All data streams valid. The sensor also continued correctly with manual testing after this 4k run with a few casual tests.

  • fixed a few more bugs in the xlink platform-wide semaphore code
  • added an xlink TRACE log level, and moved the env var read of XLINK_LEVEL from depthai-core to xlink itself.
  • tidyed some xlink log levels and text to use TRACE to make the logs more usable in deep work like I did this week
  • xlink is very bad at returning correct result values from functions. Often incorrectly mix xLinkPlatformErrorCode_t, XLinkError_t, POSIX, and native OS result codes. They are not the same integer values and can not be mixed without conversion. Zero and non-zero is not good enough -- xlink branches based on specific error int values. I found dozens of these bugs when I re-wrote the xlink USB code. And found more this week in the semaphore code.

This issue should give your team enough info to look and fix your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants