Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backtracking optimizations #3

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open

Conversation

art-w
Copy link

@art-w art-w commented Oct 3, 2022

(Follow-up on #2 )

Sorry for the big dirty PR! I was inspired to optimize the backtracking and it now terminates in a few seconds on the examples that I care about:

  • I started with some simple stuff, like not re-running the execution from scratch unless we really have to (= we are backtracking)
  • The backtracking steps happen when two "threads" access the same atomic: I thought it could be more precise, as we don't need to test the alternative ordering if both were doing an Atomic.get (we only want to backtrack when the atomic usages could yield a different outcome, so a "read" with a "write" etc.)
  • The backtracking was scheduling the alternate thread for only one operation: In general, this operation is not the one we wanted to permute (just a prerequisite) and so it had to do it multiple times before the real goal was achieved. In the example below, if we wanted to swap (A3) with (B3) then it would bubble the B operations one at a time:
AAA(A3)AAB1B2(B3)...
   \B1(A3)AAB2(B3)...
     \B2(A3)AA(B3)...
       \B3(A3)AA...

It now does a "big" backstep by scheduling all the operations to permute in the backtracking:

AAA(A3)AAB1B2(B3)...
   \B1B2(B3)(A3)AA...

This wasn't so simple though, as now the atomics and spawns can be created in a different order, changing their globally unique int identifiers. The fix to give them stable ids was to track the creations locally in each thread rather than globally (so rather than "I'm the Xth atomic ever created" we do "I'm the Xth atomic created by thread Yth... which was itself created by thread Zth from the root" with X,Y,Z local counters to each parent threads).

Furthermore, we sometimes want to backstep a thread operation to a point where the thread didn't exist yet... and so we need to also backstep its parent operations that led to our thread creation.

Anyway, I tested it intensively and hopefully I didn't mess up too badly but this was quite involved so there might be some bugs lurking... Let me know if I can do anything to ease the review :)

@art-w art-w mentioned this pull request Oct 3, 2022

let make v = if !tracing then perform (Make v) else
begin
let i = !atomics_counter in
atomics_counter := !atomics_counter + 1;
(Atomic.make v, i)
atomics_counter := !atomics_counter - 1;
Copy link
Contributor

@bartoszmodelski bartoszmodelski Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not added in this PR, but I think we should change atomics_counter to atomic - otherwise might race outside tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely, I'll switch the buggy ref to atomic!
(For context, I went with negative identifiers there to detect that a trace is interacting with an atomic created outside the tested function, but didn't follow up yet on that line of thoughts)

| Read_write, Some ptr -> add ptr last_read, add ptr last_write
| _ -> assert false
in
let new_clock = IdMap.add j new_time clock in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock, new_clock look to have no meaningful use

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes you are right! I'll remove them :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

@bartoszmodelski
Copy link
Contributor

Thanks for the PR, very cool stuff!

I'm still reviewing it - feel free handle the comments as they flow in or all at once later on. Fwiw, I've run a lot of test traces already and it's holding up perfectly well.

if List.for_all causal replay_steps
then if IdSet.mem proc_id pre_s.enabled
then Some replay_steps
else let is_parent k s = k > lower && k < time - upper && s.run.op = Spawn && s.run.obj_ptr = Some proc_id in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the k > lower && k < time - upper condition needed? That is, isn't a Spawn of some proc_id unique within a single execution (state)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the check on the interval is a defensive measure, we can't always backtrack to before our spawn... This whole function is really hard to read though :/

while IdMap.(cardinal (map_diff_set s.backtrack !dones)) > 0 do
let j, new_steps = IdMap.min_binding (map_diff_set s.backtrack !dones) in
let new_explored =
if !is_backtracking || state_planned <> [] then !dones else IdSet.empty in
Copy link
Contributor

@bartoszmodelski bartoszmodelski Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will assign empty set to new_explored if not backtracking && state_planned is empty but I think !dones has to be an empty set in such a case anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

dones := explored ;
s.backtrack <- IdMap.singleton proc_id state_planned
end ;
let is_backtracking = ref false in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, s.backtrack is "overloaded" to also handle the initial case, which is not backtracking but just continuing existing execution. Then loop relies on this always being the first element in s.backtrack, hence above boolean for special treatment. Do you reckon it'd be worth trying to split up these separate cases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code kept evolving in the "round-robin" branch where we don't have to remember if we are backtracking or not (... the code ain't perfect there either though!) Do you think it's worth fixing in this PR?

| Some lst -> List.length lst > List.length replay_steps
then pre_s.backtrack <- IdMap.add j replay_steps pre_s.backtrack

let map_diff_set map set =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps map_subtract_set is more accurate?

@art-w
Copy link
Author

art-w commented Nov 21, 2022

Rebased and fixed some issues, thanks again for the review! :)

However, I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions). It's related to backtracking when updating an Atomic created by another domain... but I haven't fully debugged it yet.

@talex5
Copy link
Contributor

talex5 commented Dec 28, 2022

I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions)

I have a test that (correctly) fails with main but passes with this PR.

git clone https://github.com/talex5/eio.git --branch dscheck-fast-bug
make dscheck

With main (c74d8a6):

Fatal error: exception File "lib_eio/core/test_cells/cells.ml", line 104, characters 46-52: Assertion failed

With this PR:

Finished after 2428 runs.

However, I'm probably just confused about how this is supposed to be used. I opened #13 with some questions about that.

@talex5
Copy link
Contributor

talex5 commented Dec 28, 2022

OK, here's a simpler test-case that detects the bug with main but not with this PR.

module Atomic = Dscheck.TracedAtomic
                  
let test () =
  let cancelled = Atomic.make false in
  let max_requests = Atomic.make 0 in
  Atomic.spawn (fun () ->
      Atomic.set cancelled true;
      Atomic.decr max_requests;
    );
  Atomic.spawn (fun () ->
      ignore (Atomic.get max_requests);
      assert (Atomic.get cancelled)     (* This bug should be detected *)
    )

let () = Atomic.trace test

With main:

Fatal error: exception File "bin/main.ml", line 12, characters 6-12: Assertion failed

With this PR:

Finished after 2 runs.

@art-w
Copy link
Author

art-w commented Feb 23, 2023

Thanks for the two counter examples! I ended up rewriting the backstepping algorithm to be a lot more straightforward (... it's still too tricky for my liking, but the structure makes it harder for it to go wrong by missing a branch)

It's of course slower than before because it doesn't wrongly skip as many traces, but it can still complete the eio testsuite if you're patient enough ^^'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants