Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-work atomic_directory locking for faster / clearer failures. #1961

Merged
merged 5 commits into from
Oct 21, 2022

Conversation

jsirois
Copy link
Member

@jsirois jsirois commented Oct 20, 2022

Change atomic_directory to always grab an exclusive lock and use a
stable work directory per target directory to surface multiple lock
owners as up-front warnings instead of possibly slow corruptions.

@jsirois
Copy link
Member Author

jsirois commented Oct 20, 2022

The 1st commit is a pure re-name such that only the subsequent commits need be looked at to see changes to the locking code and call sites.

@jsirois
Copy link
Member Author

jsirois commented Oct 20, 2022

I tested this over in Pants with ./pants fmt lint check test --force src/python:: tests/python:: after removing the named caches dir. For what that's worth with the elusive-to-repro missing files bugs, I hit no errors of that sort or the new fail-fast errors that should occur in their place.

Copy link
Collaborator

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we can get rid of FileLockStyle.BSD ?

@jsirois
Copy link
Member Author

jsirois commented Oct 20, 2022

Does this mean we can get rid of FileLockStyle.BSD ?

No. That's still needed in two cases where thread pools are used. Posix locks do not work on their own in the presence of threads for several reasons.

You may enjoy bits of my reading list:

Poettering (pulseaudio, systemd):

Allison (Samba):

SQLite source code:

@jsirois
Copy link
Member Author

jsirois commented Oct 20, 2022

If it's not clear, this PR fixes nothing and I can still spot no bugs before or after this change. I'm only aiming to make what was, afaict, correct code, simpler. The additional change of not using a uuid suffix for the workdir just doubles down on the correctness assertion by asking for / arranging a clear error up front to prove something is somehow wrong with the locking code or its assumptions.

Comment on lines 151 to 154
# If there is an error making the work_dir that means file-locking guarantees have failed
# somehow and another process has the lock and has made the work_dir already. We let the error
# from os.mkdir propagate in that case.
os.mkdir(atomic_dir.work_dir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also mean that something segfaulted or was sigkilled, such that it exited uncleanly without tearing down the working directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. I'll step back and think about this a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just warn now to ensure we get stderr data but clean up stale workdirs to prevent a stop-in-your-tracks bug that requires manual intervention.

@jsirois jsirois merged commit dbd4c13 into pex-tool:main Oct 21, 2022
@jsirois jsirois deleted the locking/debug branch October 21, 2022 19:12
jsirois added a commit to jsirois/pex that referenced this pull request Nov 3, 2022
This was broken by pex-tool#1961 which did not account for the two unlocked
uses of AtomicDirectory by the pex.resolver for wheel building and
installing. Although those directories deserve a second look as
candidates for exclusive locking, we have seen no known issues with
those directories since introduction of AtomicDirectory up until pex-tool#1961.
As such, this change just restores the prior behavior for those two
cases.

Fixes pex-tool#1968
jsirois added a commit that referenced this pull request Nov 3, 2022
This was broken by #1961 which did not account for the two unlocked
uses of AtomicDirectory by the pex.resolver for wheel building and
installing. Although those directories deserve a second look as
candidates for exclusive locking, we have seen no known issues with
those directories since introduction of AtomicDirectory up until #1961.
As such, this change just restores the prior behavior for those two
cases.

Fixes #1968
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants