Skip to content

Conversation

@MichaReiser
Copy link
Contributor

@MichaReiser MichaReiser commented May 27, 2025

This PR fixes two bugs related to nested fixpoint iteration that can lead threads to see incorrect results or panics due to #831

The main issue with our fixpoint iteration was that it assumed that a cycle would always be entered from the same query. This is given in single-threaded environments but cycles (especially if they have nested cyles) are likely entered from different queries.

The scenario is:

  • thread a: has a nested loop, it iterates the outermost loop
  • thread b: shares an inner cycle with thread a, but uses a different entry query than thread a.

The tests push this to the extreme with 4 threads and multiple queries. Shuttle revealed multiple bugs in this scenario.

Closes #831
Closes #893

Huge shoutout to @ibraheemdev for adding shuttle support. Narrowing this down without a shuttle would have been extremely hard, if not impossible. Shuttle also gives me a lot of confidence that the fixes from this PR cover most if not all cases.

Issue 1:

cycle_head_kind only returned Provisional if the memo's cycle heads contained itself (it's a provisional in its own cycle). The problem with this is that (even in non multi-threaded cases) we don't re-run queries that participate in an outer cycle: Specifically, in the added test:

  • query_d depends on query_c's initial value
  • query_e depends on query_c's initial value
  • query_c is an inner cycle

Both queries need to be re-run if query_c finished (as an inner cycle) and the outer cycle starts a second iteration, which includes starting a fresh round on query_c. Previously, cycle_head_kind returned NotProvisional because the provisional value for query_c no longer contained itself in cycle_heads; it only contains the cycle head of the outer query. Because of that, validate_provisional considered query_d and query_e as final after query_c completed the first time (it set verified_final to true).

The fix is that any memo where verified_final is false should be considered provisional, and provisional_retry should retry (which might finalize the memos), and validate_provisional should return false (and fallback to validate_same_iteration or re-execute).

I suspect that this is from before we had validate_same_iteration. This obviously leads to a significant performance regression (about 10%) because we now (correctly) run inner cycle queries more often.

Issue 2

The second issue is due to a similar problem. Thread 'a' releases the lock on the inner cycle once it reaches a final value. It then iterates the outer cycle. Thread b can now acquire the lock on the inner cycle. The result is that we iterate different cyles of the same outer cycle at the same time. This is problematic because the different threads can override each other's provisional values with fixpoint initial values whenever they hit a cycle. Thread b re-executes the query because the cycle heads aren't finalized (the outer cycle is still iterating), and validate_same_iteration will fail because thread b has a different query stack. This can lead to #831.

The fix in this PR is that salsa blocks on all cycle heads before re-executing a provisional memo from the same revision.

Issue 3

validate_same_iteration didn't work if a cycle runs on multiple threads because it only ever looked at the current query stack. This can result in hangs if the same query is re-executed repeatedly. The fix here is to look up the current memo and compare against its current iteration if requiring that memo results in a cycle with another thread, because this indicates that the two threads work on the same cycle.

Test plan

@MichaReiser MichaReiser force-pushed the deeply-nested-error branch 2 times, most recently from 1033ffa to 08c9448 Compare May 27, 2025 16:40
@netlify
Copy link

netlify bot commented May 27, 2025

Deploy Preview for salsa-rs canceled.

Name Link
🔨 Latest commit 2d00feb
🔍 Latest deploy log https://app.netlify.com/projects/salsa-rs/deploys/6835ea921972880008fb877a

@netlify
Copy link

netlify bot commented May 27, 2025

Deploy Preview for salsa-rs canceled.

Name Link
🔨 Latest commit 0c8420e
🔍 Latest deploy log https://app.netlify.com/projects/salsa-rs/deploys/683c0f892aa8be0008287b33

@codspeed-hq
Copy link

codspeed-hq bot commented May 27, 2025

CodSpeed Performance Report

Merging #882 will not alter performance

Comparing MichaReiser:deeply-nested-error (0c8420e) with master (80fb79e)

Summary

✅ 12 untouched benchmarks

@MichaReiser
Copy link
Contributor Author

MichaReiser commented May 27, 2025

Interesting, I can't reproduce this shuttle panic locally. I'm, not at all, surprised by the perf regression because we simply skipped work before.

src/function.rs Outdated
.into_iter()
.any(|head| head.database_key_index == self.database_key_index(input))
});
.is_some_and(|memo| !memo.revisions.verified_final.load(Ordering::Relaxed));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix number 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the doc comment on this method maybe needs updating now?

I'm also not sure if the name of the method or the returned struct are really sensible -- it doesn't seem like anything in this method is tied to the memo itself being a cycle head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The method itself has nothing to do with cycle heads. It's just that we only call the method on cycle heads. But something like provisional_status or similar would be better

@MichaReiser MichaReiser force-pushed the deeply-nested-error branch 2 times, most recently from 3777168 to f7bb021 Compare May 27, 2025 17:24
@MichaReiser MichaReiser force-pushed the deeply-nested-error branch 3 times, most recently from d7f2d7a to 0cc827d Compare May 29, 2025 08:19
Copy link
Contributor Author

@MichaReiser MichaReiser May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test looks so innocent, but it's everything but

@MichaReiser MichaReiser force-pushed the deeply-nested-error branch from a2b0107 to 336eaa1 Compare May 29, 2025 10:44
@MichaReiser MichaReiser requested a review from Copilot May 29, 2025 10:45
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes two bugs in the nested fixpoint iteration logic that caused incorrect results and panics in multithreaded scenarios. Key changes include:

  • Adding a new test (cycle_nested_deep) to cover deeply nested cycle cases across multiple threads.
  • Updating cycle head handling and memo verification logic to more correctly distinguish between provisional and final values.
  • Enhancing logging and refactoring functions to include a new zalsa_local parameter for improved cycle memo management.

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/parallel/main.rs Added module import for cycle_nested_deep test to cover new nested cycles.
tests/parallel/cycle_nested_deep.rs New test covering a deeply nested-cycle scenario across multiple threads.
tests/parallel/cycle_a_t1_b_t2_fallback.rs Adjusted imports to correctly reference KnobsDatabase and thread syncing utilities.
src/runtime.rs Added debug logs for cycle detection and thread blocking enhancements.
src/ingredient.rs Updated cycle head functions to return Final and use empty_cycle_heads for clarity.
src/function/memo.rs Refactored provisional_retry to use a new await_heads function; added validate_same_iteration.
src/function/maybe_changed_after.rs Updated to use validate_same_iteration for memo validation consistency.
src/function/fetch.rs Updated function calls to include zalsa_local and implemented waiting for cycle heads in provisional memos.
src/function/accumulated.rs Updated memo refresh logic to include zalsa_local.
src/function.rs Updated cycle_head_kind and cycle_heads logic to align with the new cycle semantics.
src/cycle.rs Refactored CycleHeadKind enum by replacing NotProvisional with Final.
Comments suppressed due to low confidence (1)

src/ingredient.rs:70

  • The documentation for this function should be updated to clarify that it now returns 'Final' instead of 'NotProvisional', ensuring that users understand the new handling of provisional memos.
fn cycle_head_kind(&self, zalsa: &Zalsa, input: Id) -> CycleHeadKind {

@MichaReiser MichaReiser marked this pull request as ready for review May 29, 2025 12:53
@MichaReiser MichaReiser requested a review from carljm May 29, 2025 12:59
@MichaReiser
Copy link
Contributor Author

There's one more hang that we're seeing in mypy primer run. I think it's related to the block_on_heads. At the end, only one thread is making progress. I somehow suspect that the problem is that execute keeps holding on to its lock while waiting. I wonder if we need to implement what I outlined in the todo: release the lock while blocking on any other query and then try to reclaim it (or at least, retry)

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to head out for a bit but submitting a partial review with some comments. Will come back to this later.

src/function.rs Outdated
.into_iter()
.any(|head| head.database_key_index == self.database_key_index(input))
});
.is_some_and(|memo| !memo.revisions.verified_final.load(Ordering::Relaxed));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the doc comment on this method maybe needs updating now?

I'm also not sure if the name of the method or the returned struct are really sensible -- it doesn't seem like anything in this method is tied to the memo itself being a cycle head.

@MichaReiser MichaReiser force-pushed the deeply-nested-error branch from 144a4e5 to 0815dc1 Compare May 30, 2025 09:24
@MichaReiser MichaReiser force-pushed the deeply-nested-error branch from 0815dc1 to ebb9c52 Compare May 30, 2025 09:27
@MichaReiser
Copy link
Contributor Author

I'll need to clean this up a bit but this now fixes all known issues for ty.

The third issue was that validate_same_iteration only worked if all cycle participants ran on the same thread.

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look now, have to go -- can do a deeper review later if that would be useful. (I realize you said you plan to clean this up still, some of my comments probably duplicate what you already planned.)

@MichaReiser

This comment was marked as resolved.

@MichaReiser

This comment was marked as resolved.

@MichaReiser

This comment was marked as resolved.

@MichaReiser

This comment was marked as resolved.

@MichaReiser MichaReiser added this pull request to the merge queue Jun 1, 2025
Merged via the queue into salsa-rs:master with commit 2b51887 Jun 1, 2025
24 of 26 checks passed
@MichaReiser MichaReiser deleted the deeply-nested-error branch June 1, 2025 08:55
@github-actions github-actions bot mentioned this pull request May 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CycleStrategy::FallbackImmediate: Investigate shuttle failures assertion failed: existing.iteration_count == head.iteration_count

2 participants