Backtracking optimizations #3

art-w · 2022-10-03T10:18:12Z

(Follow-up on #2 )

Sorry for the big dirty PR! I was inspired to optimize the backtracking and it now terminates in a few seconds on the examples that I care about:

I started with some simple stuff, like not re-running the execution from scratch unless we really have to (= we are backtracking)
The backtracking steps happen when two "threads" access the same atomic: I thought it could be more precise, as we don't need to test the alternative ordering if both were doing an Atomic.get (we only want to backtrack when the atomic usages could yield a different outcome, so a "read" with a "write" etc.)
The backtracking was scheduling the alternate thread for only one operation: In general, this operation is not the one we wanted to permute (just a prerequisite) and so it had to do it multiple times before the real goal was achieved. In the example below, if we wanted to swap (A3) with (B3) then it would bubble the B operations one at a time:

AAA(A3)AAB1B2(B3)...
   \B1(A3)AAB2(B3)...
     \B2(A3)AA(B3)...
       \B3(A3)AA...

It now does a "big" backstep by scheduling all the operations to permute in the backtracking:

AAA(A3)AAB1B2(B3)...
   \B1B2(B3)(A3)AA...

This wasn't so simple though, as now the atomics and spawns can be created in a different order, changing their globally unique int identifiers. The fix to give them stable ids was to track the creations locally in each thread rather than globally (so rather than "I'm the Xth atomic ever created" we do "I'm the Xth atomic created by thread Yth... which was itself created by thread Zth from the root" with X,Y,Z local counters to each parent threads).

Furthermore, we sometimes want to backstep a thread operation to a point where the thread didn't exist yet... and so we need to also backstep its parent operations that led to our thread creation.

Anyway, I tested it intensively and hopefully I didn't mess up too badly but this was quite involved so there might be some bugs lurking... Let me know if I can do anything to ease the review :)

bartoszmodelski · 2022-10-24T19:33:30Z

src/tracedAtomic.ml


 let make v = if !tracing then perform (Make v) else
    begin
      let i = !atomics_counter in
-      atomics_counter := !atomics_counter + 1;
-      (Atomic.make v, i)
+      atomics_counter := !atomics_counter - 1;


Not added in this PR, but I think we should change atomics_counter to atomic - otherwise might race outside tests.

Yes absolutely, I'll switch the buggy ref to atomic!
(For context, I went with negative identifiers there to detect that a trace is interacting with an atomic created outside the tested function, but didn't follow up yet on that line of thoughts)

bartoszmodelski · 2022-10-24T19:36:13Z

src/tracedAtomic.ml

+      | Read_write, Some ptr -> add ptr last_read, add ptr last_write
+      | _ -> assert false
+    in
+    let new_clock = IdMap.add j new_time clock in


clock, new_clock look to have no meaningful use

Oh yes you are right! I'll remove them :)

bartoszmodelski · 2022-10-24T19:42:06Z

Thanks for the PR, very cool stuff!

I'm still reviewing it - feel free handle the comments as they flow in or all at once later on. Fwiw, I've run a lot of test traces already and it's holding up perfectly well.

bartoszmodelski · 2022-10-31T14:30:47Z

src/tracedAtomic.ml

+    if List.for_all causal replay_steps
+    then if IdSet.mem proc_id pre_s.enabled
+         then Some replay_steps
+         else let is_parent k s = k > lower && k < time - upper && s.run.op = Spawn && s.run.obj_ptr = Some proc_id in


Is the k > lower && k < time - upper condition needed? That is, isn't a Spawn of some proc_id unique within a single execution (state)?

Yes the check on the interval is a defensive measure, we can't always backtrack to before our spawn... This whole function is really hard to read though :/

bartoszmodelski · 2022-10-31T16:11:55Z

src/tracedAtomic.ml

+  while IdMap.(cardinal (map_diff_set s.backtrack !dones)) > 0 do
+    let j, new_steps = IdMap.min_binding (map_diff_set s.backtrack !dones) in
+    let new_explored =
+      if !is_backtracking || state_planned <> [] then !dones else IdSet.empty in


This will assign empty set to new_explored if not backtracking && state_planned is empty but I think !dones has to be an empty set in such a case anyway.

bartoszmodelski · 2022-10-31T17:15:54Z

src/tracedAtomic.ml

+      dones := explored ;
+      s.backtrack <- IdMap.singleton proc_id state_planned
+  end ;
+  let is_backtracking = ref false in


If I understand this correctly, s.backtrack is "overloaded" to also handle the initial case, which is not backtracking but just continuing existing execution. Then loop relies on this always being the first element in s.backtrack, hence above boolean for special treatment. Do you reckon it'd be worth trying to split up these separate cases?

This code kept evolving in the "round-robin" branch where we don't have to remember if we are backtracking or not (... the code ain't perfect there either though!) Do you think it's worth fixing in this PR?

bartoszmodelski · 2022-10-31T17:16:30Z

src/tracedAtomic.ml

+        | Some lst -> List.length lst > List.length replay_steps
+        then pre_s.backtrack <- IdMap.add j replay_steps pre_s.backtrack
+
+let map_diff_set map set =


perhaps map_subtract_set is more accurate?

art-w · 2022-11-21T12:34:49Z

Rebased and fixed some issues, thanks again for the review! :)

However, I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions). It's related to backtracking when updating an Atomic created by another domain... but I haven't fully debugged it yet.

talex5 · 2022-12-28T11:24:23Z

I then stumbled upon a new test where dscheck should report an issue but doesn't (I'm not sure if the bug was introduced in this PR, as the test doesn't terminate in previous versions)

I have a test that (correctly) fails with main but passes with this PR.

git clone https://github.com/talex5/eio.git --branch dscheck-fast-bug
make dscheck

With main (c74d8a6):

Fatal error: exception File "lib_eio/core/test_cells/cells.ml", line 104, characters 46-52: Assertion failed

With this PR:

Finished after 2428 runs.

However, I'm probably just confused about how this is supposed to be used. I opened #13 with some questions about that.

talex5 · 2022-12-28T19:43:19Z

OK, here's a simpler test-case that detects the bug with main but not with this PR.

module Atomic = Dscheck.TracedAtomic
                  
let test () =
  let cancelled = Atomic.make false in
  let max_requests = Atomic.make 0 in
  Atomic.spawn (fun () ->
      Atomic.set cancelled true;
      Atomic.decr max_requests;
    );
  Atomic.spawn (fun () ->
      ignore (Atomic.get max_requests);
      assert (Atomic.get cancelled)     (* This bug should be detected *)
    )

let () = Atomic.trace test

With main:

Fatal error: exception File "bin/main.ml", line 12, characters 6-12: Assertion failed

With this PR:

Finished after 2 runs.

art-w · 2023-02-23T01:07:21Z

Thanks for the two counter examples! I ended up rewriting the backstepping algorithm to be a lot more straightforward (... it's still too tricky for my liking, but the structure makes it harder for it to go wrong by missing a branch)

It's of course slower than before because it doesn't wrongly skip as many traces, but it can still complete the eio testsuite if you're patient enough ^^'

art-w mentioned this pull request Oct 3, 2022

Round robin scheduler #4

Open

bartoszmodelski reviewed Oct 24, 2022

View reviewed changes

bartoszmodelski reviewed Oct 31, 2022

View reviewed changes

art-w force-pushed the optims branch from 6c48557 to 16c8663 Compare November 21, 2022 12:23

talex5 mentioned this pull request Dec 29, 2022

Make Eio.Condition lock-free ocaml-multicore/eio#397

Merged

talex5 mentioned this pull request Jan 25, 2023

Add cancellable lock-free synchronous channel ocaml-multicore/eio#413

Merged

art-w added 16 commits January 28, 2023 17:04

full replay only when necessary

621e48e

list optims

b9f806e

more precise backtracking

a497185

fixup num_runs

62c5e2e

fixup backtrack

42d8418

backtracking: mark only last operation

0252bf1

big bad backtracking steps

5b606bb

add simple test

4e6f1b3

fix: allow backtracking inside big steps

e68277b

fix: print replay trace for all unhandled errors

b923d3b

stable domains/atomics identifiers

92d6015

fix: backtracking nested spawns

9d4fc10

fix: atomic ops categorization

5bdc331

fix: atomic uid generation

c2e08ba

simplify dead code

26d1e7c

fix: backtracking dependencies and process spawn/start

7fa6a54

art-w added 2 commits February 4, 2023 22:37

use alcotest

d0c0803

prettier output

9d7192d

art-w force-pushed the optims branch from 63aad4d to 9d7192d Compare February 23, 2023 00:34

This was referenced Feb 23, 2023

Fix memory leak #14

Merged

Add random tests generator #15

Open

art-w mentioned this pull request Mar 22, 2023

MPMC unbounded queue ocaml-multicore/saturn#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backtracking optimizations #3

Backtracking optimizations #3

art-w commented Oct 3, 2022

bartoszmodelski Oct 24, 2022 •

edited

Loading

art-w Oct 26, 2022

bartoszmodelski Oct 24, 2022

art-w Oct 26, 2022

art-w Nov 21, 2022

bartoszmodelski commented Oct 24, 2022

bartoszmodelski Oct 31, 2022

art-w Nov 21, 2022

bartoszmodelski Oct 31, 2022 •

edited

Loading

art-w Nov 21, 2022

bartoszmodelski Oct 31, 2022

art-w Nov 21, 2022

bartoszmodelski Oct 31, 2022

art-w commented Nov 21, 2022

talex5 commented Dec 28, 2022 •

edited

Loading

talex5 commented Dec 28, 2022

art-w commented Feb 23, 2023

Backtracking optimizations #3

Are you sure you want to change the base?

Backtracking optimizations #3

Conversation

art-w commented Oct 3, 2022

bartoszmodelski Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartoszmodelski commented Oct 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bartoszmodelski Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

art-w commented Nov 21, 2022

talex5 commented Dec 28, 2022 • edited Loading

talex5 commented Dec 28, 2022

art-w commented Feb 23, 2023

bartoszmodelski Oct 24, 2022 •

edited

Loading

bartoszmodelski Oct 31, 2022 •

edited

Loading

talex5 commented Dec 28, 2022 •

edited

Loading