-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to make the monarch more thread safe. #11189
Conversation
@check-spelling-bot ReportUnrecognized words, please review:
Previously acknowledged words that are now absentSPACEBAR Unregister xIcon yIconTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the git@github.com:Rosefield/terminal.git repository
✏️ Contributor please read thisBy default the command suggestion will generate a file named based on your commit. That's generally ok as long as you add the file to your commit. Someone can reorganize it later.
If the listed items are:
See the 🔬 You can test your commits without appending to a PR by creating a new branch with that extra change and pushing it to your fork. The check-spelling action will run in response to your push -- it doesn't require an open pull request. By using such a branch, you can limit the number of typos your peers see you make. 😉 🗜️ If you see a bunch of garbageIf it relates to a ... well-formed patternSee if there's a pattern that would match it. If not, try writing one and adding it to a Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines. Note that patterns can't match multiline strings. binary-ish stringPlease add a file path to the File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.
|
Since it doesn't seem like I can assign reviews, @zadjii-msft @DHowett @lhecker @leonMSFT I would appreciate plenty of eyes and advice on this. |
{ | ||
try | ||
std::shared_lock lock{ _peasantsMutex }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you hold this lock for the entire duration of the function you don't need to lock it again below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deliberately wanted to separate the read only vs the rw portions of the lock to reduce contention if there was nothing to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock is largely uncontented and as such there shouldn't be any noticeable advantage to using read-only locking. Personally I'd just acquire an exclusive lock in all cases. But I don't mind the current code either.
src/cascadia/Remoting/Monarch.h
Outdated
std::unordered_map<uint64_t, winrt::Microsoft::Terminal::Remoting::IPeasant> _peasants; | ||
std::shared_mutex _peasantsMutex{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using
til::shared_mutex<std::unordered_map<uint64_t, IPeasant>> _peasants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did originally consider it, but disregarded because the first iteration of this had a single std::recursive_mutex _mutex
that was used for locking both _peasants
and _mruPeasants
. Of course the current iteration does not have that, and I was able to separate out (and downgrade to) a std::shared_mutex
.
Once we are confident that this locking strategy is what we want I can take another look at using til::shared_mutex
, although I don't look forward to updating all of the remoting tests that rely on friend access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possibly I would like this refactor to be deferred for now. I took a shot at doing so, but it proved to be a bit more difficult than I hoped.
src/cascadia/Remoting/Monarch.h
Outdated
|
||
std::vector<Remoting::WindowActivatedArgs> _mruPeasants; | ||
std::recursive_mutex _mruPeasantsMutex{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there ever a situation where this mutex is locked recursively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_getMostRecentPeasantID calls (through a couple hops) _clearOldMruEntries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right... That's unfortunate - recursive mutexes are really annoying (it uses critical sections brrrr).
I wonder if it would simplify the code if we just slapped a single mutex to the beginning of all public methods of this class and made all private methods non-thread-safe. It would get rid of the recursion and the atomics at least... The Monarch class has an extremely slow "throughput" (API calls per second) right?
Hmm... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In normal circumstances, yes. The problem is with loading / quitting where there are potentially "many" (O(#windows)) calls, and of course that is where I found these problems in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I', not terribly worried about the perf here, there's not really going to be that many calls coming through here per second, and they're pretty much all user-input driven. Plus, O(N) where N=number of windows? That's gonna be pretty small unless you're actively trying to spawn 100's of windows per second and close them quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a look if there is an easy refactor to get rid of the recursive mutex for a shared_mutex, but I think it is still better to have the more granular locking in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without rewriting _getMostRecentPeasantID
and _getPeasant
the easiest safe thing I can think of is to pass a std::unique_lock<...>*
to _getPeasant
and _clearOldMruEntries
as an argument and if that is null then lock in _clearOldMruEntries
. I'm not a huge fan of that approach because it is error prone, but functional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made _getPeasant
take an optional clearMruPeasantOnFailure
boolean, which is set to false in _getMostRecentPeasantID
. I then updated _getMostRecentPeasantID
to handle its own cleanup, and switched the recursive mutex to a shared mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent more time trying to make sure we can't deadlock. I am ~90% confident, but have not proved, that the current version of the code cannot deadlock. Currently _getMostRecentPeasantID
(and transitively anything that calls it) is the only method that attempts to hold both locks at the same time. There should not be the case where someone else holds a unique lock on _peasantsMutex
and is waiting to get a lock on _mruPeasantsMutex
which would deadlock with _getMostRecentPeasantID
.
// If we are quitting we don't care about maintaining our list of | ||
// peasants anymore, and don't need to notify the host that something | ||
// changed. | ||
if (_quitting.load(std::memory_order_acquire)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to sync the memory with Monarch::_handleQuitAll
?
Is there a reason _quitting
is needed? Why would a listener care if it got notified during shutdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary, but since each peasant calls this on close I don't want them waiting for the unique_lock
until after the _handleQuitAll
finishes. This would also trigger _GetNumberOfPeasants through a couple of event hops and that would be a bunch of needless churn and lock contention.
…r code we expect a dead peasant to have.
As a sanity check I tried #11083 with these changes merged locally, and was able to successfully open / save / quit / reload at least 34 windows multiple times without any troubles. Trying to use the base branch I regularly got heap corruption errors on load with the same saved state. The saved state for reference: Tried again with 128 windows, as a rough benchmark it takes ~1s after the quit confirmation for all windows to be visibly closed, and on open it takes ~18s for all windows to be visually open (showing a terminal). On load includes a sizeable amount of other applications freezing (such as the timer app) for multiple seconds, but that is probably reasonable behavior for spawning ~384 processes. |
…-thread-safe Conflicts: src/cascadia/Remoting/Monarch.cpp src/cascadia/UnitTests_Remoting/RemotingTests.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me! Some of the other stuff lhecker mentioned might be nice too, but this is definitely better than before. Thanks for helping clean up some of my shortcomings 😄
void RequestHideNotificationIcon() { throw winrt::hresult_error{}; }; | ||
void RequestQuitAll() { throw winrt::hresult_error{}; }; | ||
void Quit() { throw winrt::hresult_error{}; }; | ||
void AssignID(uint64_t /*id*/) { throw winrt::hresult_error(winrt::hresult{ (int32_t)0x800706ba }); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may want to actually use RPC_S_SERVER_UNAVAILABLE
(or RPC_E_SERVER_UNAVAILABLE
, unclear what the difference is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are these constants defined? I'm not seeing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
winerror.h
, which I would have presumed was already included. Oh but that might be an NTSTATUS
, not a HRESULT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define RPC_S_SERVER_UNAVAILABLE 1722L
Yah, that looks like a different constant
src/cascadia/Remoting/Monarch.h
Outdated
|
||
std::vector<Remoting::WindowActivatedArgs> _mruPeasants; | ||
std::recursive_mutex _mruPeasantsMutex{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I', not terribly worried about the perf here, there's not really going to be that many calls coming through here per second, and they're pretty much all user-input driven. Plus, O(N) where N=number of windows? That's gonna be pretty small unless you're actively trying to spawn 100's of windows per second and close them quickly.
…sive_mutex. This is done by making _getPeasant not remove from _mruPeasants on failure, and _getMostRecentPeasantID handle its own cleanup.
@check-spelling-bot ReportUnrecognized words, please review:
Previously acknowledged words that are now absentsid SPACEBAR Unregister xIcon yIconTo accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands... in a clone of the git@github.com:Rosefield/terminal.git repository
✏️ Contributor please read thisBy default the command suggestion will generate a file named based on your commit. That's generally ok as long as you add the file to your commit. Someone can reorganize it later.
If the listed items are:
See the 🔬 You can test your commits without appending to a PR by creating a new branch with that extra change and pushing it to your fork. The check-spelling action will run in response to your push -- it doesn't require an open pull request. By using such a branch, you can limit the number of typos your peers see you make. 😉 🗜️ If you see a bunch of garbageIf it relates to a ... well-formed patternSee if there's a pattern that would match it. If not, try writing one and adding it to a Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines. Note that patterns can't match multiline strings. binary-ish stringPlease add a file path to the File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.
|
Curious ... does this do anything to help the issue where you can press win+backtick twice really fast and get two windows that think they're named |
I just tested, no it does not. In fact it might make it more common. I think fixing that would require separate synchronization in the monarch. Basically the following happens
There are locks each time |
…deadlock from being introduced.
…at take unique locks.
src/cascadia/Remoting/Monarch.cpp
Outdated
// I wish I had std::erase_if (C++20) but I don't yet | ||
// partition the array into [not dead | dead ] | ||
auto partition = std::stable_partition(_mruPeasants.begin(), _mruPeasants.end(), [&](const auto& p) { | ||
const auto id = p.PeasantID(); | ||
const auto it = std::find(peasantsToErase.cbegin(), peasantsToErase.cend(), id); | ||
// keep the element if it was not found in the list to erase. | ||
return it == peasantsToErase.cend(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't you use std::remove_if
(erase-remove idiom)?
const auto it = std::remove_if(_mruPeasants.begin(), _mruPeasants.end(), [&](const auto& p) {
const auto id = p.PeasantID();
const auto it = std::find(peasantsToErase.cbegin(), peasantsToErase.cend(), id);
return it != peasantsToErase.cend();
});
_mruPeasants.erase(it, _mruPeasants.end());
Since you repeatedly do a std::find
over a vector it would be preferable to use a std::unordered_set
for peasantsToErase
. Afterwards you can write:
_mruPeasants.erase(std::remove_if(_mruPeasants.begin(), _mruPeasants.end(), [&](const auto& p) {
return peasantsToErase.count(p.PeasantID());
}), _mruPeasants.end());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is definitely clearer. For some reason I couldn't find remove_if
(maybe I was looking on std::vector
directly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think there is value in using unordered_set, especially with an expected small number of values to erase, but that is mostly me being stubborn. Really the problem is that the C++ std library is deficient and there is no convenient .contains
on a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also view it in a different light: peasantsToErase
is almost always empty during runtime, so the performance benefit of using a vector normally doesn't apply. Neither container class allocates memory in such a case.
On the other hand if - due to some random misfortune - we do end up having a lot of peasants to erase, the unordered set will perform asymptotically better. It's also more readable in my opinion.
Although I honestly wouldn't block this PR over such a detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more pedantic, unordered_set.contains
has worst case runtime of linear still :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hash function we use through the STL (FNV1a) makes that quite unlikely however.
std::unordered_set
is about ~2x slower than std::vector
for such worst cases IIRC (with small data sets).
I'm fairly stoked on adding robin_hood::unordered_map
soon, as it uses a plain vector as the underlying container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switched this and the other cases that used vectors to unordered_sets, and made the _clearOldMruEntries just take a set as parameter.
{ | ||
std::shared_lock lock{ _peasantsMutex }; | ||
names.reserve(_peasants.size()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW This makes me certain that we should create a thread safe proxy class for the Monarch in the future.
(Basically wrap the public methods of the class and make them thread-safe. The actual MonarchImpl class can then continue pretending threads don't exist.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something to say about the simplicity of just throwing a std::unique_lock lock {_mutex}
at the top of everything. There is a minor concern about event handlers that call back up to the monarch but that is a solvable problem too. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if you consider how we got event handlers, it's clear that recursive mutexes are the best solution here. You already have me entirely convinced about that. 😄
In either case I suppose that's something for a different time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I'm still cool with this. I agree, this would probably be easier with one lock for all of the monarch internals, but I'm not going to stand in the way of correctness.
Hello @zadjii-msft! Because this pull request has the p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (
|
🎉 Handy links: |
Summary of the Pull Request
References
#11083
#11143
PR Checklist
Detailed Description of the Pull Request / Additional comments
While testing the save/quit features a number of issues were found that were caused by poor synchronization on the monarch, resulting in various unexpected crashes. Because this uses std collections, and I didn't see any builtin winrt multithreaded containers I went with the somewhat heavy-handed mutex approach.
e.g.
This also makes it so that on quit peasants don't try to become the monarch, and the monarch closes their peasant last to prevent elections from happening.
Validation Steps Performed
Create many windows (hold down ctrl-shift-n) then use the quit action from peasants/the monarch to make sure everything closes properly.