Initial OCaml 5 / Eio port #71

talex5 · 2023-07-11T16:06:16Z

This is a port of the solver service to OCaml 5 and Eio. Since most of the code in the old version was managing the child processes, which no longer exist, a lot of the code has changed. In particular, we had some bugs with the parent-child communication channel recently, and that code is now gone completely :-)

There are a few other advantages to the new version:

First, it loads opam files lazily, so start up is much faster. Running the original stress test (solving 3 packages in parallel) gives:

Before: 40.16s
After: 5.47s

That should help with development, and also if we have multiple clients wanting different opam-repository commits (this can be optimised further; we only cache the most recently used commit for now).

Also, the code is much simpler now. wc -l service/* worker/* went from 2261 lines before to 1152 lines at present (I only checked these because the tests have changed).

But the main reason is because we want an OCaml 5 service in production to shake out remaining bugs in OCaml, Eio, etc.

As well as getting the stress test to do more solves, and check that they succeed, I added test/test.ml with some unit-tests.

Notes:

In the steady state it is a bit slower. Doing another 10 rounds of solves with the same three packages, I got 1.19s/iter before and 1.26s/iter afterwards. This is likely due to more GC work from using a single process. Might be able to tune things a bit to improve this.
Spawning processes (e.g. to run git or opam) results in SIGCHLD signals, which can cause Lwt to crash (see SIGSEGV on OCaml 5 due to missing SA_ONSTACK ocsigen/lwt#981 and Race in worker_loop ocsigen/lwt#994), even if Lwt wasn't the one that spawned the process. I've made a PR on Lwt to fix these issues and I've pinned in in the solver's opam file.
OCaml 5.1 doesn't have PPC support (but it will be in 5.2), so the deployment will need updating to avoid that.
This version still has some Lwt code (e.g. git_unix), via the lwt_eio bridge. I think this is fine; we want to test the bridge too! We can port more of it to Eio later (in particular, porting git-unix will be useful to complete Eio's FS API).

service/stores.ml

avsm · 2023-07-11T16:22:13Z

stress/stress.ml

+  (* Main run *)
+  let n_iters = 10 in
+  Format.printf "Running another %d rounds...@." n_iters;
+  let before = Unix.gettimeofday () in


Eio could also export a monotonic/cputime clock measurement to make this sort of benchmarking more accurate -- we use gettimeofday in a few places in this diff already.

bin/main.ml

avsm · 2023-07-11T16:25:12Z

bin/main.ml

+  ()
+
+let export service ~on:socket =
+  Lwt_eio.run_lwt @@ fun () ->


If you use your Eio port of Capnp, we wouldn't have any Lwt dependency at all to this solver...

I'd like to go fully Eio if we can.

We're also using it for git_unix.

Most of the complexity in the previous version was about managing the child worker processes. With OCaml 5 this all goes away, and we just use multiple domains instead. Notes: - opam is not thread-safe, so we need a lock around all opam file parsing operations. See ocaml/opam#5591 - We now load opam files lazily, which makes testing much faster. - Epoch_lock is gone. It was just for managing sub-processes, but a simple Eio.Mutex.t will do now. This should also be much faster when switching between opam-repository commits. - The main executable is now in the `bin` directory. Simplifies the dune file. The command-line API has changed a bit. - test_service.ml is gone. It was only testing parsing of the communications between parent and child processes, which no longer exist. This also removed the need to functorise over opam repository. Instead, there are some new tests which update a local mock opam repository and use the solver's public API. - Default to `recommended_domain_count - 1` workers. - Removed uses of (non-threadsafe) Str module. - OCluster support no longer exports host metrics. That was just a quick hack for monitoring, but should really be a separate service. - Combined solver-service and solver-worker into a single package (but still two libraries). Although the plain solver service avoids some dependencies, we don't really have a use for it. It would probably make sense to move the child-pipe logic into ocaml-ci for local use, so it can avoid the runtime dependency if it wants to. - Errors parsing opam files are now reported as serious errors, not just as platforms without a solution. - The stress tests now do a few solves as a warm-up, then do another few rounds with the same packages (checking it gets the same results as during the warm-up). It also now checks that the warm-up succeeded (although it doesn't check that the selection is correct). Note that the stress tests now uses fixed platform variables. Previously, they tried to solve for the platform running the stress client, which seemed odd. - There is a single stress test executable which can test both the service and the cluster. Previously, they performed different tests. - Removed internal-workers from docker-compose, as the default is likely better. - Removed the OCluster connection logic from stress tests to simplify things. This means it won't auto-reconnect if the connection fails during the test, but that's probably a good thing.

This is useful to test against a staging pool.

talex5 · 2023-07-18T09:42:55Z

I tested a deployment going via the real scheduler (using the solver-test pool) on arm64-jade-1 (thanks to @mtelvers for help setting it up!). It's a little hard to get accurate benchmarks because the machine is also being used for builds at the same time, but here are my results.

All tests are using the new stress.exe:

dune exec -- stress/stress.exe cluster talex5.cap --count=50 --solver-pool=solver-test

Old solver (/usr/local/bin/solver-worker -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --state-dir=/var/cache/solver --capacity=40 --internal-workers=160 --verbosity=info):

Solved warm-up requests in: 65.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.43s (0.35s/iter)

The new solver can't use 160 cores (OCaml 5 is limited to 128 domains), but in practice anything above 6 seems to give rapidly diminishing returns anyway. However, we can run multiple instances of the new version. With three instances using 6 worker domains each, I get (/usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-{1,1b,1c}.equinix.ci.dev' --cache-dir=/var/cache/solver{,2,3} --capacity=12 --internal-workers=6 --verbosity=info):

Solved warm-up requests in: 24.80s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.84s (0.36s/iter)

For comparison, here's the new solver with a single 18-core instance (--capacity=36 --internal-workers=18):

Solved warm-up requests in: 37.22s
Running another 50 solves...
50/50 complete
Solved 50 requests in 38.85s (0.78s/iter)

Cold-start time (loading a new opam-repository Git commit) is always better with the new version.

In summary: the old version scales better with more cores, but the new version can get the same performance with fewer cores in total if we run more than one instance. Note: I didn't try optimising the old version (e.g. running with fewer workers); it's possible that's not optimal either.

Given that we only have one solver service in the prod cluster at the moment (but using a large number of cores), it would probably already be a benefit to deploy several instances of the new solver, using just a few cores on each machine.

talex5 · 2023-07-18T11:04:15Z

Updated benchmarks, with the machine otherwise idle:

The old server in its original configuration (up to 160 worker processes, though probably only 80 being used at once due to capacity 40 and 2 platforms per stress request):

# /usr/local/bin/old/solver-worker -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --state-dir=/var/cache/solver --capacity=40 --internal-workers=160 --verbosity=info
Solved warm-up requests in: 65.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 17.06s (0.34s/iter)

A single new solver with 6 worker domains:

# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=12 --internal-workers=6 --verbosity=info
Solved warm-up requests in: 44.86s
Running another 50 solves...
50/50 complete
Solved 50 requests in 45.44s (0.91s/iter)

A single new solver with 5 worker domains is faster:

# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=10 --internal-workers=5 --verbosity=info
Solved warm-up requests in: 38.07s
Running another 50 solves...
50/50 complete
Solved 50 requests in 35.43s (0.71s/iter)

A single new solver with 4 worker domains is faster still:

# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=8 --internal-workers=4 --verbosity=info
Solved warm-up requests in: 36.42s
Running another 50 solves...
50/50 complete
Solved 50 requests in 33.51s (0.67s/iter)

3 worker domains is slower, though:

# /usr/local/bin/new/solver-service run-cluster -c /etc/ocluster/solver.cap --name='arm64-jade-1.equinix.ci.dev' --cache-dir=/var/cache/solver --capacity=6 --internal-workers=3 --verbosity=info
Solved warm-up requests in: 43.55s
Running another 50 solves...
50/50 complete
Solved 50 requests in 41.76s (0.84s/iter)

2 instances of the new solver, 4 worker domains each:

Solved warm-up requests in: 23.63s
Running another 50 solves...
50/50 complete
Solved 50 requests in 18.21s (0.36s/iter)

3 instances of the new solver, 4 worker domains each, is faster than the old solver and using fewer cores:

Solved warm-up requests in: 20.90s
Running another 50 solves...
50/50 complete
Solved 50 requests in 14.79s (0.30s/iter)

mtelvers · 2023-07-18T11:12:22Z

Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool. With that restriction removed and lower cpu requirements, could we deploy a solver work to many machines in the cluster along side the normal worker?

talex5 · 2023-07-18T11:16:54Z

Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.

I'm not aware of any such restriction (which would seem to defeat the purpose of using the cluster). Perhaps @moyodiallo remembers what the problem was?

moyodiallo · 2023-07-18T11:35:41Z

Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.

@mtelvers are you talking about the pool of process behind the service ? But with the old version it possible to have more than one solver-worker connected to the scheduler.

could we deploy a solver work to many machines in the cluster along side the normal worker?

Yes we can.

moyodiallo · 2023-07-18T11:52:07Z

@talex5 Thanks for the port, I read the code it's wonderful. Easier to read than the old version.

There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.

mtelvers · 2023-07-18T11:52:35Z

Can the new solver instances be distributed across machines? Alpha said that the original solver could only have a single instance running servicing the solver pool.

@mtelvers are you talking about the pool of process behind the service ? But with the old version it possible to have more than one solver-worker connected to the scheduler.

OK. It is different to what I remember, I originally wanted to use both jade-1 and jade-2 to distribute the solver service, but you advised that it was not possible. However, these are excellent developments!

moyodiallo · 2023-07-18T11:56:44Z

OK. It is different to what I remember, I originally wanted to use both jade-1 and jade-2 to distribute the solver service, but you advised that it was not possible. However, these are excellent developments!

Yeah my bad, this is because we was hitting on some bugs, more easier to track with one machine.

talex5 · 2023-07-18T14:24:38Z

There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.

Since solve jobs only take about a second to run, I just ignored cancellation. The only thing that takes much time is doing a git-fetch, and it's probably safest not to cancel that anyway.

moyodiallo · 2023-07-18T15:01:33Z

There's just one last thing, after that we can deploy it on the prod. It is about the switch when a job is canceled.

Since solve jobs only take about a second to run, I just ignored cancellation. The only thing that takes much time is doing a git-fetch, and it's probably safest not to cancel that anyway.

Yes, in the case of domains there's no way to cancel, we could avoid running a request if its switch is off.

moyodiallo · 2023-07-18T15:17:59Z

To put it in context:
Let say we're at the limit, the solver-worker receive 200 jobs from the scheduler and each jobs has 10 platforms (could be more). Suddenly the CI cancel all those jobs (that could happen when there's two consecutive push on opam-repository), now ending up solving 2000 requests per platform, before starting the new jobs.

talex5 · 2023-07-18T15:56:41Z

Let say we're at the limit, the solver-worker receive 200 jobs from the scheduler

Ideally, the solver should only be pulling jobs from the scheduler when it's ready to run them. Otherwise, it might be taking jobs that another worker could be starting immediately. So, it should only have a backlog of a few seconds.

In the stress tests above, for example, I used a capacity of 8, so there should be a maximum of 7 unnecessary jobs run if they all get cancelled.

With that said, I think it would be perfectly fine to add support to the solver for this. But there's always a risk of adding bugs this way. It looks like the current code responds to a cancelled job by killing the entire worker process:

solver-service/service/service.ml

Lines 129 to 136 in f14bc6f

    
                   ( Lwt_switch.add_hook_or_exec (Some switch) @@ fun () -> 
        
                     (* Release the worker before cancelling the promise of the request, in order to prevent the 
        
                      * workers's pool choosing the worker for another processing.*) 
        
                     if Lwt.state process = Lwt.Sleep then ( 
        
                       Worker_process.release worker; 
        
                       Lwt.cancel process; 
        
                       dispose worker) 
        
                     else Lwt.return_unit )

However, starting a replacement worker usually takes 30s or more, so this is probably a bad trade-off!

moyodiallo · 2023-07-19T07:40:29Z

However, starting a replacement worker usually takes 30s or more, so this is probably a bad trade-off!

I get it, it could be a bad trade-off, we're going to do some test about that situation to see how it behave with the new version.

If a job is cancelled by the time its worker domain is ready, just skip it. Requested by Alpha Issiaga DIALLO.

talex5 · 2023-07-19T10:13:11Z

I've pushed an extra commit adding support for cancellation now.

This makes it easy to benchmark changes to the solver without having to run a separate process.

talex5 · 2023-07-19T13:05:35Z

@kayceesrk wanted some GC stats, so I've pushed a commit adding a stress local mode, to run the solver in-process. That makes it easier to benchmark things without having to keep restarting the service. Here are the results on my 8-core x86_64 machine:

olly gc-stats './_build/default/stress/stress.exe local --cache-dir=./cache --count=10 --internal-workers=...'

1 worker  : GC overhead (% of wall time):	15.58% (1.27s/iter)
2 workers : GC overhead (% of wall time):	30.87% (0.75s/iter)
3 workers : GC overhead (% of wall time):	43.98% (0.60s/iter)
4 workers : GC overhead (% of wall time):	56.23% (0.50s/iter)
5 workers : GC overhead (% of wall time):	67.93% (0.47s/iter)
6 workers : GC overhead (% of wall time):	77.34% (0.48s/iter)
7 workers : GC overhead (% of wall time):	88.53% (0.45s/iter)

dune-project

This is more realistic, as we typically reject a large number of platforms.

This reduced the time for a local stress run with 10 jobs from 36.8s to 17.5s.

If we get a bad request, display the faulty JSON. If we get an exception, log the backtrace.

ocaml-multicore/eio#593

kayceesrk · 2023-07-28T13:03:33Z

It doesn't look like GC is the issue. This is running:

for i in 1 2 4 8 12 16 20 24 28 32; do olly gc-stats OCAMLRUNPARAM="v=0x400" "./_build/default/stress/stress.exe local --cache-dir ./cache --count=10 --internal-workers=$i" >> op 2>&1; done

on a machine with 16 cores and 32 hw threads. Something odd is going on with heap words though.

Internal workers	Allocated words	Minor words	Promoted words	Major words	Minor Collections	Major collections	Heap words	Top heap words	GC overhead (%)
1	41269478700	41254670221	854179642	868988121	157219	145	22709031	30940204	8.19
2	41269353013	41254541846	864377483	879188650	96966	120	25339318	31089532	12.51
4	41269471338	41254661969	867621809	882431178	53045	86	45524069	50378130	16.56
8	41269574198	41254762765	870674229	885485662	29568	60	81483517	87401394	18.28
12	41274350994	41259540213	873924394	888735175	21704	50	113651507	118256060	19.18
16	41274668096	41259857131	880376307	895187272	18413	46	141591008	146173236	19
20	41275384744	41260573456	885979002	900790290	16063	42	165394370	169711772	20.9
24	41275582949	41260770292	890406066	905218723	14648	40	209939555	214781312	18.58
28	41275669513	41260857701	886699819	901511631	12509	38	223300581	227967667	16.38
32	41275970071	41261156136	867731304	882545239	9426	37	190727588	196603130	8.54

kayceesrk · 2023-07-28T13:05:09Z

In order to debug the heap words things, I tried to build this on OCaml 5.0.0, and got the following error:

#=== ERROR while compiling lwt_eio.0.4 ========================================#
# context     2.1.2 | linux/x86_64 | ocaml-base-compiler.5.0.0 | pinned(git+https://github.com/talex5/lwt_eio.git#27688ec6aa551b9e510cd6a13747880d97451dc4#27688ec6)
# path        ~/.opam/5.0.0/.opam-switch/build/lwt_eio.0.4
# command     ~/.opam/opam-init/hooks/sandbox.sh build dune build -p lwt_eio -j 4 @install
# exit-code   1
# env-file    ~/.opam/log/lwt_eio-488846-7e7dfb.env
# output-file ~/.opam/log/lwt_eio-488846-7e7dfb.out
### output ###
# (cd _build/default && /home/kc/.opam/5.0.0/bin/ocamlc.opt -w -40 -g -bin-annot -I lib/.lwt_eio.objs/byte -I /home/kc/.opam/5.0.0/lib/bigstringaf -I /home/kc/.opam/5.0.0/lib/bytes -I /home/kc/.opam/5.0.0/lib/cstruct -I /home/kc/.opam/5.0.0/lib/domain-local-await -I /home/kc/.opam/5.0.0/lib/eio -I /home/kc/.opam/5.0.0/lib/eio/core -I /home/kc/.opam/5.0.0/lib/eio/unix -I /home/kc/.opam/5.0.0/lib[...]
# File "lib/lwt_eio.ml", line 69, characters 6-28:
# 69 |       Lwt_unix.handle_signal Sys.sigchld
#            ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value Lwt_unix.handle_signal
# (cd _build/default && /home/kc/.opam/5.0.0/bin/ocamlopt.opt -w -40 -g -I lib/.lwt_eio.objs/byte -I lib/.lwt_eio.objs/native -I /home/kc/.opam/5.0.0/lib/bigstringaf -I /home/kc/.opam/5.0.0/lib/bytes -I /home/kc/.opam/5.0.0/lib/cstruct -I /home/kc/.opam/5.0.0/lib/domain-local-await -I /home/kc/.opam/5.0.0/lib/eio -I /home/kc/.opam/5.0.0/lib/eio/core -I /home/kc/.opam/5.0.0/lib/eio/unix -I /home[...]
# File "lib/lwt_eio.ml", line 69, characters 6-28:
# 69 |       Lwt_unix.handle_signal Sys.sigchld
#            ^^^^^^^^^^^^^^^^^^^^^^
# Error: Unbound value Lwt_unix.handle_signal

It would be useful to check whether the heap words keep going up on 5.0 as well.

talex5 · 2023-07-31T15:12:13Z

Here's a small test that just allocates some bytes in a loop:

let dom0_worker = false

module Usage = struct
  type t = {
    wall : float;
    user : float;
    system : float;
  }

  let now () =
    let wall = Unix.gettimeofday () in
    let process = Unix.times () in
    { wall; user = process.tms_utime; system = process.tms_stime }

  let ( - ) a b =
    {
      wall = a.wall -. b.wall;
      user = a.user -. b.user;
      system = a.system -. b.system;
    }

  let pp f { wall; user; system } =
    Format.fprintf f "wall:%.2fs user:%.2fs system:%.2fs" wall user system
end

let run_worker () =
  for _ = 1 to 100_000_000 do
    ignore @@ Bytes.create 500
  done

let test workers =
  let u0 = Usage.now () in
  let domain_workers = if dom0_worker then workers - 1 else workers in
  let domains = List.init domain_workers (fun _ -> Domain.spawn run_worker) in
  if dom0_worker then
    run_worker ();
  List.iter Domain.join domains;
  let u1 = Usage.now () in
  Usage.(u1 - u0)


let () =
  Format.printf "%s / dom0_worker=%b@." Sys.ocaml_version dom0_worker;
  ignore @@ test 1;
  let t1 = test 1 in
  Format.printf "1 : %a@." Usage.pp t1;
  [2; 4; 8; 16; 32; 64; (if dom0_worker then 128 else 127)] |> List.iter (fun i ->
    let ti = test i in
    let speedup = t1.wall /. ti.wall in
    Format.printf "%d : %a (%.2f times faster)@." i Usage.pp ti speedup
  )

5.1.0-beta1 seems better than 5.0, but with 128 cores it still takes 30 times longer!

Results from the 160-core arm machine (wall time, lower is better):

Oddly, a previous run had 5.0.0/false being really fast (4.08s for 127 cores), but I can't reproduce that now! Might be something odd going on with CPU frequency scaling when it gets hot or something?

kayceesrk · 2023-08-01T08:40:03Z

The numbers are curious. So I did a number of things to remove potential noise. First, I run each configuration as a separate run of the program with the aim of avoiding GC impact from the previous runs. Next, I run each program multiple times using hyperfine, which removes the possibility of a noisy run affecting the result.

The code is here: https://github.com/kayceesrk/code-snippets/tree/master/500_501_regression

The core of the benchmark remains the same as the earlier one. https://github.com/kayceesrk/code-snippets/blob/d3d1838c4199f0aed85add42d686d7a83e2b69c6/500_501_regression/test2.ml#L6-L22

I ran the benchmark on an x86 machine with 16 physical cores, and 2 hardware threads per core. The machine has 2 NUMA domains:

$ lscpu | grep -E "^Thread|^Core|^Socket|^NUMA node\("
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       2
NUMA node(s):                    2

Here are the results:

x-axis is wall time in milliseconds. Lower is better. Ideally, the lines will be flat.

The results don't show the regression between 5.0 and 5.1 that you observe on the ARM machine. There is a drop in performance at 8 cores due to NUMA effects. Beyond 16, hyperthreading effects causes further slowdown.

talex5 · 2023-08-07T07:47:31Z

The results don't show the regression between 5.0 and 5.1 that you observe on the ARM machine

It might be just a typo, but all your series are labelled "5.0.0" in that graph - is it showing the 5.1 results?

kayceesrk · 2023-08-07T10:32:24Z

It was a typo. Fixed here:

Update the README example output for the current stress tests, which test on more platforms and are therefore slower than before.

talex5 · 2023-09-18T10:55:37Z

I've updated this for Eio 0.12 (which removed all the pins).

moyodiallo · 2023-09-19T14:27:22Z

I stopped the old solver-worker and sent a lot of requests, one of the stress test is 250 requests at same time to the 5 new solver-worker deployed at the moment, it's working very well.

Looks good to me, we can merge.

bin/main.ml

dune

dune-project

tmcgilchrist · 2023-09-20T04:08:33Z

worker/solver_worker.ml

+    let output = Custom.solve ~cancelled ~solver:t.solver ~log c in
+    Log_data.write log "Job succeeded\n";
+    (Ok output, "ok")
+  | _ ->


Suggested change

| _ ->

| Obuilder_build _ | Docker_build _ ->

Why? We don't care if OCluster adds more job types here.

Personal preference to use exhaustive pattern matching and get the warning IF more job types get added.

talex5 · 2023-09-22T11:52:02Z

Updated based on comments. Should be ready to merge now.

tmcgilchrist

This all looks fantastic @talex5 thank you.

The reduction in code size from LWT to Eio is impressive and makes the concurrent/parallel control code much clearer. Further work can be done on the scaling / garbage collection in future.

Out of interest this is what olly.exe gc-stats './_build/default/stress/stress.exe local --cache-dir=./cache --count=10 --internal-workers=?' looks like on a MacBookPro x86 4 physical (8 hyper threaded) cores.

Cores	Solves
1	Solved 10 requests in 104.65s (10.47s/iter) (1.53 solves/s)
2	Solved 10 requests in 44.29s (4.43s/iter) (3.61 solves/s)
3	Solved 10 requests in 38.84s (3.88s/iter) (4.12 solves/s)
4	Solved 10 requests in 33.80s (3.38s/iter) (4.73 solves/s)
6	Solved 10 requests in 28.12s (2.81s/iter) (5.69 solves/s)
7	Solved 10 requests in 43.26s (4.33s/iter) (3.70 solves/s)
8	Solved 10 requests in 41.30s (4.13s/iter) (3.87 solves/s)

avsm reviewed Jul 11, 2023

View reviewed changes

service/stores.ml Show resolved Hide resolved

avsm reviewed Jul 11, 2023

View reviewed changes

bin/main.ml Show resolved Hide resolved

avsm reviewed Jul 11, 2023

View reviewed changes

tmcgilchrist self-requested a review July 12, 2023 00:04

tmcgilchrist mentioned this pull request Jul 12, 2023

Eio port mirage/capnp-rpc#256

Closed

talex5 force-pushed the eio branch 3 times, most recently from dc82fdc to 4a62de2 Compare July 14, 2023 17:11

talex5 force-pushed the eio branch from 96199b0 to 12f49f6 Compare July 16, 2023 14:19

talex5 marked this pull request as ready for review July 16, 2023 14:27

Add stress.exe --solver-pool=POOL option

c44ce7b

This is useful to test against a staging pool.

Skip solving cancelled jobs

e48a506

If a job is cancelled by the time its worker domain is ready, just skip it. Requested by Alpha Issiaga DIALLO.

Add stress local mode

c09d51b

This makes it easy to benchmark changes to the solver without having to run a separate process.

tmcgilchrist reviewed Jul 26, 2023

View reviewed changes

dune-project Show resolved Hide resolved

talex5 added 3 commits July 26, 2023 09:42

Display solves/s in stress test output

27c7529

Use more compiler versions in the stress test

47e1fb4

This is more realistic, as we typically reject a large number of platforms.

Don't run the solver for unavailable packages

45450f6

This reduced the time for a local stress run with 10 jobs from 36.8s to 17.5s.

talex5 force-pushed the eio branch 2 times, most recently from 96383be to 0107d6c Compare July 27, 2023 12:08

talex5 added 2 commits July 28, 2023 10:03

Improve error reporting from solver worker

5d188e7

If we get a bad request, display the faulty JSON. If we get an exception, log the backtrace.

Pin eio for child process fix

173a813

ocaml-multicore/eio#593

talex5 force-pushed the eio branch from a7efcc9 to 173a813 Compare July 28, 2023 09:03

talex5 added 2 commits September 18, 2023 10:32

Update to OCaml 5.1 and Eio 0.12

88ebe7f

Update README

4d152ce

Update the README example output for the current stress tests, which test on more platforms and are therefore slower than before.

talex5 force-pushed the eio branch from 6db11e6 to 4d152ce Compare September 18, 2023 10:55

Sudha247 mentioned this pull request Sep 19, 2023

Non-terminating runtime events ocaml/ocaml#12584

Open

tmcgilchrist reviewed Sep 20, 2023

View reviewed changes

talex5 and others added 2 commits September 20, 2023 10:16

Updated based on Tim's review comments

6760ac3

Updates from review comments

036a7f6

talex5 mentioned this pull request Sep 22, 2023

Filter root packages by "available" condition #72

Merged

tmcgilchrist reviewed Sep 25, 2023

View reviewed changes

tmcgilchrist merged commit 6e935f5 into ocurrent:main Sep 25, 2023

tmcgilchrist mentioned this pull request Sep 25, 2023

Adapt solver-service for ocaml-ci and ocaml-docs-CI #17

Open

5 tasks

moyodiallo mentioned this pull request Sep 27, 2023

Add metrics about the solver-service behaviour #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial OCaml 5 / Eio port #71

Initial OCaml 5 / Eio port #71

talex5 commented Jul 11, 2023 •

edited

Loading

avsm Jul 11, 2023

avsm Jul 11, 2023

tmcgilchrist Jul 12, 2023

talex5 Jul 12, 2023

talex5 commented Jul 18, 2023

talex5 commented Jul 18, 2023

mtelvers commented Jul 18, 2023

talex5 commented Jul 18, 2023

moyodiallo commented Jul 18, 2023 •

edited

Loading

moyodiallo commented Jul 18, 2023 •

edited

Loading

mtelvers commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

talex5 commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

talex5 commented Jul 18, 2023 •

edited

Loading

moyodiallo commented Jul 19, 2023

talex5 commented Jul 19, 2023

talex5 commented Jul 19, 2023

kayceesrk commented Jul 28, 2023

kayceesrk commented Jul 28, 2023

talex5 commented Jul 31, 2023 •

edited

Loading

kayceesrk commented Aug 1, 2023 •

edited

Loading

talex5 commented Aug 7, 2023

kayceesrk commented Aug 7, 2023

talex5 commented Sep 18, 2023

moyodiallo commented Sep 19, 2023 •

edited

Loading

tmcgilchrist Sep 20, 2023

talex5 Sep 20, 2023

tmcgilchrist Sep 21, 2023

talex5 commented Sep 22, 2023

tmcgilchrist left a comment

Initial OCaml 5 / Eio port #71

Initial OCaml 5 / Eio port #71

Conversation

talex5 commented Jul 11, 2023 • edited Loading

avsm Jul 11, 2023

Choose a reason for hiding this comment

avsm Jul 11, 2023

Choose a reason for hiding this comment

tmcgilchrist Jul 12, 2023

Choose a reason for hiding this comment

talex5 Jul 12, 2023

Choose a reason for hiding this comment

talex5 commented Jul 18, 2023

talex5 commented Jul 18, 2023

mtelvers commented Jul 18, 2023

talex5 commented Jul 18, 2023

moyodiallo commented Jul 18, 2023 • edited Loading

moyodiallo commented Jul 18, 2023 • edited Loading

mtelvers commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

talex5 commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

moyodiallo commented Jul 18, 2023

talex5 commented Jul 18, 2023 • edited Loading

moyodiallo commented Jul 19, 2023

talex5 commented Jul 19, 2023

talex5 commented Jul 19, 2023

kayceesrk commented Jul 28, 2023

kayceesrk commented Jul 28, 2023

talex5 commented Jul 31, 2023 • edited Loading

kayceesrk commented Aug 1, 2023 • edited Loading

talex5 commented Aug 7, 2023

kayceesrk commented Aug 7, 2023

talex5 commented Sep 18, 2023

moyodiallo commented Sep 19, 2023 • edited Loading

tmcgilchrist Sep 20, 2023

Choose a reason for hiding this comment

talex5 Sep 20, 2023

Choose a reason for hiding this comment

tmcgilchrist Sep 21, 2023

Choose a reason for hiding this comment

talex5 commented Sep 22, 2023

tmcgilchrist left a comment

Choose a reason for hiding this comment

talex5 commented Jul 11, 2023 •

edited

Loading

moyodiallo commented Jul 18, 2023 •

edited

Loading

moyodiallo commented Jul 18, 2023 •

edited

Loading

talex5 commented Jul 18, 2023 •

edited

Loading

talex5 commented Jul 31, 2023 •

edited

Loading

kayceesrk commented Aug 1, 2023 •

edited

Loading

moyodiallo commented Sep 19, 2023 •

edited

Loading