Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several data races #488

Merged
merged 4 commits into from
Jan 28, 2025
Merged

Fix several data races #488

merged 4 commits into from
Jan 28, 2025

Conversation

ijackson
Copy link
Contributor

There were several places where the code in libfaketime.c used ad-hoc synchronisation (eg for doing the initialisation only once) using non-atomic C variables. In a multithreaded program, this is not correct. The C language rules are that data races like this are completely forbidden, and the compiler is allowed to optimise the code using the assumption that such races do not ever occur.

In practice, while trying to fix faketime in Debian trixie, I found that faketime misbehaved in CI, in the test suites of various other packages. Notably, the Ruby interpreter, but other cases too. Some cases involved stack traces, and others "impossible" results. In each case, I examined the output from the test case, and the apparently-relevant code in libfaketime.c. In each case, I found code in libfaketime.c that seemed to be to be incorrect. Fixing the data races, by using the correct constructions and the correct primitives, fixed all the tests.

It's likely that the approach I've taken won't compile on some platforms targeted by libfaketime. I have relied on the existence of pthread_once. For platforms without pthread_once, there will hopefully be an analogous function. If there are platforms without an anlogous function, there is an obvious implementation in terms of a mutex, at the cost of some loss of performance. I would strongly counsel against attempts to open-code using spin locks and atomics, unless the code can be both written, and separately reviewed, by at least two experts in these techniques. In any case, implementations using ordinary non-atomic variables cannot be correct and will lead to race bugs. Sorry that I haven't provided support for those other platforms.

I think the final patch "Don't use try locking calls for monotonic_conds_lock" will fix #464.

Together with #487 and #486, these race fixes fix the CI failures with packages' test suites, in Debian.

At the cost of no longer nicely detecting recursive initialisation
problems.

Fixes Debian bug #1093599
Otherwise we can use this in an uninitialised state, which is not
allowed.

We call ftpl_init in pthread_cond_init_232, but the application might
not have called that.  For example, it might have a static condition
variable set up with PTHREAD_COND_INITIALIZER.
This reverts commit 8ef74e3
   "Swapped out pthread_rwlock_xxlock() ..."

This could result in concurrent uses of pthread_cond_* erroneously
returning EAGAIN, which is not permitted by the spec and which the
application way well treat as a bug.  This seems to be happening in
gem2deb in ci.debian.net.

The commit message in 8ef74e3 says (rewrapped)

    Swapped out pthread_rwlock_xxlock(), which doesn't return if it
    can't obtain the lock, with pthread_rwlock_xxtrylock() followed by
    sched yield and error code return. The issue is sometimes a thread
    calling pthread_cond_init() or pthread_cond_destroy() can't
    acquire the lock when another thread is waiting on a condition
    variable notification via pthread_cond_timedwait(), and thus the
    thread calling pthread_cond_init() or pthread_cond_destroy() end
    up hanging indefinitely.

I don't think this is true.  The things that are done with
monotonic_conds_lock held are HASH_ADD_PTR HASH_FIND_PTR etc. on
monotonic_conds, which should all be fast and AFAICT don't in turn
take any locks.  So it shouldn't deadlock.

I conjecture that the underlying bug being experienced by the author
of "Swapped out pthread_rwlock_xxlock" was the lack of ftpl_init - ie,
access to an uninitialised pthread_rwlock_t.  That might result in a
hang.
@ijackson
Copy link
Contributor Author

(force pushed to fix a Debianism in one of the bug references in a commit message.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected return values from pthread_cond_init_232, pthread_cond_destroy_232, pthread_cond_timedwait
2 participants