-
Notifications
You must be signed in to change notification settings - Fork 194
fix: multithreaded nested fixpoint iteration #882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1033ffa to
08c9448
Compare
✅ Deploy Preview for salsa-rs canceled.
|
✅ Deploy Preview for salsa-rs canceled.
|
CodSpeed Performance ReportMerging #882 will not alter performanceComparing Summary
|
|
Interesting, I can't reproduce this shuttle panic locally. I'm, not at all, surprised by the perf regression because we simply skipped work before. |
src/function.rs
Outdated
| .into_iter() | ||
| .any(|head| head.database_key_index == self.database_key_index(input)) | ||
| }); | ||
| .is_some_and(|memo| !memo.revisions.verified_final.load(Ordering::Relaxed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix number 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the doc comment on this method maybe needs updating now?
I'm also not sure if the name of the method or the returned struct are really sensible -- it doesn't seem like anything in this method is tied to the memo itself being a cycle head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. The method itself has nothing to do with cycle heads. It's just that we only call the method on cycle heads. But something like provisional_status or similar would be better
3777168 to
f7bb021
Compare
d7f2d7a to
0cc827d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test looks so innocent, but it's everything but
a2b0107 to
336eaa1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes two bugs in the nested fixpoint iteration logic that caused incorrect results and panics in multithreaded scenarios. Key changes include:
- Adding a new test (cycle_nested_deep) to cover deeply nested cycle cases across multiple threads.
- Updating cycle head handling and memo verification logic to more correctly distinguish between provisional and final values.
- Enhancing logging and refactoring functions to include a new zalsa_local parameter for improved cycle memo management.
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/parallel/main.rs | Added module import for cycle_nested_deep test to cover new nested cycles. |
| tests/parallel/cycle_nested_deep.rs | New test covering a deeply nested-cycle scenario across multiple threads. |
| tests/parallel/cycle_a_t1_b_t2_fallback.rs | Adjusted imports to correctly reference KnobsDatabase and thread syncing utilities. |
| src/runtime.rs | Added debug logs for cycle detection and thread blocking enhancements. |
| src/ingredient.rs | Updated cycle head functions to return Final and use empty_cycle_heads for clarity. |
| src/function/memo.rs | Refactored provisional_retry to use a new await_heads function; added validate_same_iteration. |
| src/function/maybe_changed_after.rs | Updated to use validate_same_iteration for memo validation consistency. |
| src/function/fetch.rs | Updated function calls to include zalsa_local and implemented waiting for cycle heads in provisional memos. |
| src/function/accumulated.rs | Updated memo refresh logic to include zalsa_local. |
| src/function.rs | Updated cycle_head_kind and cycle_heads logic to align with the new cycle semantics. |
| src/cycle.rs | Refactored CycleHeadKind enum by replacing NotProvisional with Final. |
Comments suppressed due to low confidence (1)
src/ingredient.rs:70
- The documentation for this function should be updated to clarify that it now returns 'Final' instead of 'NotProvisional', ensuring that users understand the new handling of provisional memos.
fn cycle_head_kind(&self, zalsa: &Zalsa, input: Id) -> CycleHeadKind {
|
There's one more hang that we're seeing in mypy primer run. I think it's related to the |
carljm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to head out for a bit but submitting a partial review with some comments. Will come back to this later.
src/function.rs
Outdated
| .into_iter() | ||
| .any(|head| head.database_key_index == self.database_key_index(input)) | ||
| }); | ||
| .is_some_and(|memo| !memo.revisions.verified_final.load(Ordering::Relaxed)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the doc comment on this method maybe needs updating now?
I'm also not sure if the name of the method or the returned struct are really sensible -- it doesn't seem like anything in this method is tied to the memo itself being a cycle head.
…en't verified final (They should be validated by `validate_same_iteration` or wait for the cycle head
144a4e5 to
0815dc1
Compare
0815dc1 to
ebb9c52
Compare
…er cycle made progress (in which case there's a probably a new memo)
|
I'll need to clean this up a bit but this now fixes all known issues for ty. The third issue was that |
carljm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look now, have to go -- can do a deeper review later if that would be useful. (I realize you said you plan to clean this up still, some of my comments probably duplicate what you already planned.)
This PR fixes two bugs related to nested fixpoint iteration that can lead threads to see incorrect results or panics due to #831
The main issue with our fixpoint iteration was that it assumed that a cycle would always be entered from the same query. This is given in single-threaded environments but cycles (especially if they have nested cyles) are likely entered from different queries.
The scenario is:
The tests push this to the extreme with 4 threads and multiple queries. Shuttle revealed multiple bugs in this scenario.
Closes #831
Closes #893
Huge shoutout to @ibraheemdev for adding shuttle support. Narrowing this down without a shuttle would have been extremely hard, if not impossible. Shuttle also gives me a lot of confidence that the fixes from this PR cover most if not all cases.
Issue 1:
cycle_head_kindonly returnedProvisionalif the memo's cycle heads contained itself (it's a provisional in its own cycle). The problem with this is that (even in non multi-threaded cases) we don't re-run queries that participate in an outer cycle: Specifically, in the added test:query_ddepends onquery_c's initial valuequery_edepends onquery_c's initial valuequery_cis an inner cycleBoth queries need to be re-run if
query_cfinished (as an inner cycle) and the outer cycle starts a second iteration, which includes starting a fresh round onquery_c. Previously,cycle_head_kindreturnedNotProvisionalbecause the provisional value forquery_cno longer contained itself incycle_heads; it only contains the cycle head of the outer query. Because of that,validate_provisionalconsideredquery_dandquery_eas final afterquery_ccompleted the first time (it setverified_finaltotrue).The fix is that any memo where
verified_finalis false should be considered provisional, andprovisional_retryshould retry (which might finalize the memos), andvalidate_provisionalshould returnfalse(and fallback tovalidate_same_iterationor re-execute).I suspect that this is from before we had
validate_same_iteration. This obviously leads to a significant performance regression (about 10%) because we now (correctly) run inner cycle queries more often.Issue 2
The second issue is due to a similar problem. Thread 'a' releases the lock on the inner cycle once it reaches a final value. It then iterates the outer cycle. Thread
bcan now acquire the lock on the inner cycle. The result is that we iterate different cyles of the same outer cycle at the same time. This is problematic because the different threads can override each other's provisional values with fixpoint initial values whenever they hit a cycle. Thread b re-executes the query because the cycle heads aren't finalized (the outer cycle is still iterating), andvalidate_same_iterationwill fail because thread b has a different query stack. This can lead to #831.The fix in this PR is that salsa blocks on all cycle heads before re-executing a provisional memo from the same revision.
Issue 3
validate_same_iterationdidn't work if a cycle runs on multiple threads because it only ever looked at the current query stack. This can result in hangs if the same query is re-executed repeatedly. The fix here is to look up the current memo and compare against its current iteration if requiring that memo results in a cycle with another thread, because this indicates that the two threads work on the same cycle.Test plan