Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks left running after DDS exits. #379

Open
rbx opened this issue Jul 6, 2021 · 3 comments
Open

Tasks left running after DDS exits. #379

rbx opened this issue Jul 6, 2021 · 3 comments
Assignees
Milestone

Comments

@rbx
Copy link
Member

rbx commented Jul 6, 2021

Affected DDS versions: 3.5.10, 3.5.14

One (or possibly several, only reproduced with this one so far) of FairMQ tests, that is executed with DDS, frequently leaves behind running devices.

The test is:
https://github.com/FairRootGroup/FairMQ/blob/master/test/sdk/_topology.cxx#L264-L277
It runs this topology:
https://github.com/FairRootGroup/FairMQ/blob/master/test/sdk/test_topo.xml
(one task + group of 5 tasks)

The test queues a state change operation via custom commands with a timeout of 1ms, and when that expires (the value is intentionally set low to test the timeout function) proceeds to exit.
Frequently - in about 50% of cases, one of the tasks is left over. Specifically, it seems to be the single task and not one of the group.

Examining the device logs, and comparing to a successful run, the left-over device is left running where it would usually receive device shutdown request (signal 15). Sending SIGINT or SIGTERM to the leftover device leads to an immediate and proper exit, so it is not anything that hangs on the device side.
Devices from the task group do the right thing in both failed and successful runs. (EDIT: I've now also seen the devices from the group be leftover).

Examining DDS logs, one line stands out in the unsuccessful run:

2021-07-06 12:54:36.272234   err    dds-agent            <0x002bb252:0x00007f7f2269d640>    Error sending to : Broken pipe
@rbx
Copy link
Member Author

rbx commented Jul 6, 2021

Attaching tmp dirs for both runs.

failure.zip
success.zip

@rbx
Copy link
Member Author

rbx commented Jul 6, 2021

When running the above test repeatedly (ctest -R "SDK.Topology.AsyncChangeStateTimeout" -V --repeat-until-fail 100), I also get some occasional failures that don't seem to be related to the issue above (not sure):

161: [ RUN      ] Topology.AsyncChangeStateTimeout
161: unknown file: Failure
161: C++ exception with description "Failed to start a new session. Error code 1; error: dds-session: error: invalid uuid string
161: " thrown in the test fixture's constructor.
161: [  FAILED  ] Topology.AsyncChangeStateTimeout (18 ms)

@rbx
Copy link
Member Author

rbx commented Jul 6, 2021

There is also one run of this test where all of the devices and also some agents keep running:

image

The error in the log gives a bit more details:

2021-07-06 14:07:46.551255   err    dds-agent            <0x002dd7b5:0x00007f01bd649640>    Error sending to : Broken pipe
2021-07-06 14:07:46.551363   err    dds-agent            <0x002dd7b5:0x00007f01b4e48640>    The received message doesn't have a handler: [] ID: 0; CRC: 0; data size (header+body): 16

Attaching its logs:
7a70a79b-461f-4d87-a5d4-ca88bf8db295.zip

(it is not the one with C++ exception with description "Failed to start a new session. Error code 1; error: dds-session: error: invalid uuid string)

@AnarManafov AnarManafov self-assigned this Jul 7, 2021
@AnarManafov AnarManafov added this to the 3.6 milestone Jul 7, 2021
@AnarManafov AnarManafov modified the milestones: 3.6, 3.8 Jan 11, 2022
@AnarManafov AnarManafov modified the milestones: 3.8, 3.9 Jan 12, 2024
@AnarManafov AnarManafov modified the milestones: 3.9, 3.10, 3.11 Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants