Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logger: fix thread deadlock #14312

Merged
merged 1 commit into from
Mar 6, 2020
Merged

logger: fix thread deadlock #14312

merged 1 commit into from
Mar 6, 2020

Conversation

dusan19
Copy link
Contributor

@dusan19 dusan19 commented Mar 6, 2020

Describe problem solved by this pull request
Logger sometimes doesn't create a log file.

If the logger gets stopped at an unfortunate time, the buffer.close_file() function can be skipped, in the log_writer_file run loop.

The lock gets unlocked here
https://github.com/PX4/Firmware/blob/master/src/modules/logger/log_writer_file.cpp#L274-L277
which allows for _buffers[0]._should_run to change to false. The file didn't close yet since the _buffers[0]._should_run was true when that check happened.
The logger now does't have new data so the notify() is not called, which results in:
https://github.com/PX4/Firmware/blob/master/src/modules/logger/log_writer_file.cpp#L299
waiting indefinitely.

When the logger is started again, this while loop runs infinitely, waiting for the file to get closed, which is blocked by the pthread_cond_wait(&_cv, &_mtx).
https://github.com/PX4/Firmware/blob/master/src/modules/logger/log_writer_file.cpp#L90-L95
and no new log file is created.

Describe your solution
Do not wait for new data if the logger is not running, instead run the loop once more and allow for the remaining data to be written and file closed.

Test data / coverage
Due to being a race condition, the problem can be reproduced at random times when stopping the logger. The problem occurs in SITL as well. I debugged the problem and tested a solution with various print statements. In order to test if the solution is valid since it occurs randomly, I caught situations when the loop would be stuck in a wait condition, but was "saved" by the fix instead, and one more loop iteration was performed, and the file was closed.

Additional context
Our system basically never shuts down, which means that once this bug happens, we do not get the logs anymore until we notice we do not get them (usually already too late) and reboot the drone.

@julianoes julianoes requested a review from bkueng March 6, 2020 11:50
Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find and description! This looks correct.

Our system basically never shuts down

Out of curiosity, how long do you keep it running?

@bkueng bkueng merged commit d7a9b12 into PX4:master Mar 6, 2020
@dusan19
Copy link
Contributor Author

dusan19 commented Mar 6, 2020

Out of curiosity, how long do you keep it running?

Its the Sunflower Labs security system, (Im working together with @ndepal ). In principle the system runs forever, its not supposed to be switched off (although we think that the estimator ekf2, sometimes breaks after the drone is running for a long time)

@jkflying
Copy link
Contributor

jkflying commented Mar 6, 2020

although we think that the estimator ekf2, sometimes breaks after the drone is running for a long time

Do you have a log you can share? Open a bug on ECL repo with the log and we can try to help.

@dusan19
Copy link
Contributor Author

dusan19 commented Mar 6, 2020

unfortunately, not having a log is what motivated this PR :(

bkueng referenced this pull request Mar 26, 2020
The file writer thread could get into a state where it blocked infinitely
on pthread_cond_wait() (or rather until the logging stops).

This is very rare and the following conditions must be met:
- the buffer is almost empty (<4KB filled), so that the writer thread does
  not write anything.
- an fsync call is scheduled (happens once every second)
- the fsync call takes a long time (several 100ms), during which time the
  complete log buffer fills up.

The main thread would then get into dropout state where it does not call
_writer.notify() anymore.

Notifying the writer thread on every loop update of the main thread fixes
that.

It does not increase resource usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants