Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpamFile.OPAM.read_from_string is not domain-safe #5591

Open
talex5 opened this issue Jun 30, 2023 · 3 comments
Open

OpamFile.OPAM.read_from_string is not domain-safe #5591

talex5 opened this issue Jun 30, 2023 · 3 comments

Comments

@talex5
Copy link
Contributor

talex5 commented Jun 30, 2023

It seems that trying to parse two opam files at the same time results in a crash:

let data = {|
opam-version: "2.0"
version: "2.1.5"
synopsis: "Format library for opam 2.1"
|}

let test () =
  try
    for _ = 1 to 1000 do
      let file = OpamFile.OPAM.read_from_string data in
      ignore file
    done
  with ex ->
    Printexc.print_backtrace stderr;
    Printf.eprintf "test.ml: %s\n%!" (Printexc.to_string ex)

let () =
  Printexc.record_backtrace true;               (* Does nothing?? *)
  let a = Domain.spawn test in
  let b = Domain.spawn test in
  Domain.join a;
  Domain.join b
$ dune exec ./test.exe
test.ml: At ./<none>:2:14-2:19::  
Parse error
fish: Job 1, 'dune exec ./test.exe' terminated by signal SIGSEGV (Address boundary error)

Different runs produce different errors, e.g.

test.ml: At ./<none>:3:7-3:8::         
Parse error
test.ml: Invalid_argument("index out of bounds")

I guess this is because it uses Stdlib.Parsing, which doesn't look thread-safe. Would be good to have some documentation on how to use the library safely with OCaml 5.

Tested with opam-format 2.1.5.

@talex5
Copy link
Contributor Author

talex5 commented Jun 30, 2023

Also, opamStd.ml's terminal_columns installs a signal handler that uses lazy (and isn't atomic):

opam/src/core/opamStd.ml

Lines 958 to 961 in d1fe3e8

try Sys.set_signal 28 (* SIGWINCH *)
(Sys.Signal_handle
(fun _ -> v := lazy (get_terminal_columns ())))
with Invalid_argument _ -> ()

This crashes under multicore if two threads try to log at the same time:

CamlinternalLazy.Undefined
Raised at CamlinternalLazy.force_gen_lazy_block in file "camlinternalLazy.ml", line 75, characters 9-24
Called from OpamStd.OpamFormat.reformat in file "src/core/opamStd.ml" (inlined), line 1280, characters 44-71
Called from OpamConsole.warning.(fun) in file "src/core/opamConsole.ml", line 584, characters 6-63

@avsm
Copy link
Member

avsm commented Jul 3, 2023

The segv is possibly due to the Stdlib.Parsing module being thread-unsafe and involving C bindings (https://github.com/ocaml/ocaml/blob/trunk/stdlib/parsing.ml). A standalone testcase would be suitable to create an issue on the ocaml/ocaml repo.

talex5 added a commit to talex5/solver-service that referenced this issue Jul 3, 2023
talex5 added a commit to talex5/solver-service that referenced this issue Jul 4, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.

TODO:

- Further speed up the lazy loading. Maybe load all versions of a
  package at once?

- Bring back parent/child pipe mode.

- Integrate worker mode?

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 5, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.

- Default to `recommended_domain_count - 1` workers.

TODO:

- Bring back parent/child pipe mode.

- Integrate worker mode?

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 5, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

TODO:

- Bring back parent/child pipe mode.

- Integrate worker mode?

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 6, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer extends the Docker builder. Therefore, the
  solver can no longer perform builds. It also no longer exports host
  metrics.

TODO:

- Reinit Git_unix after pulling. Otherwise, it doesn't notice the new
  commits.

- Integrate worker mode?

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 6, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

TODO:

- Reinit Git_unix after pulling. Otherwise, it doesn't notice the new
  commits.

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 9, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Added `--cache-dir` argument to say where to keep the Git clones.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

TODO:

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 9, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Added `--cache-dir` argument to say where to keep the Git clones.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

TODO:

- Finish unit tests.

- Add full stress test (i.e. solving for many packages over many hours).
talex5 added a commit to talex5/solver-service that referenced this issue Jul 11, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Added `--cache-dir` argument to say where to keep the Git clones.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

- Errors parsing opam files are now reported as serious errors, not just
  as platforms without a solution.

- The stress tests now do the original 3-package solve as a warm-up,
  then do another 10 rounds with the same 3 packages (checking it gets
  the same results as during the warm-up). It also now checks that the
  warm-up succeeded (although it doesn't check that the selection is
  correct).

  Note that the stress tests now uses fixed platform variables.
  Previously, they tried to solve for the platform running the stress
  client, which seemed odd.
talex5 added a commit to talex5/solver-service that referenced this issue Jul 11, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Added `--cache-dir` argument to say where to keep the Git clones.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

- Errors parsing opam files are now reported as serious errors, not just
  as platforms without a solution.

- The stress tests now do the original 3-package solve as a warm-up,
  then do another 10 rounds with the same 3 packages (checking it gets
  the same results as during the warm-up). It also now checks that the
  warm-up succeeded (although it doesn't check that the selection is
  correct).

  Note that the stress tests now uses fixed platform variables.
  Previously, they tried to solve for the platform running the stress
  client, which seemed odd.
talex5 added a commit to talex5/solver-service that referenced this issue Jul 11, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.
  This reduces the time for `stress/stress.exe` from 97.4s to 11.5s for
  me.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- The `--sockpath` option is gone. You could already use
  `--address=unix:...` anyway.

- Move license to a separate file.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Added `--cache-dir` argument to say where to keep the Git clones.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

- Errors parsing opam files are now reported as serious errors, not just
  as platforms without a solution.

- The stress tests now do the original 3-package solve as a warm-up,
  then do another 10 rounds with the same 3 packages (checking it gets
  the same results as during the warm-up). It also now checks that the
  warm-up succeeded (although it doesn't check that the selection is
  correct).

  Note that the stress tests now uses fixed platform variables.
  Previously, they tried to solve for the platform running the stress
  client, which seemed odd.
talex5 added a commit to talex5/solver-service that referenced this issue Jul 16, 2023
Most of the complexity in the previous version was about managing the
child worker processes. With OCaml 5 this all goes away, and we just
use multiple domains instead.

Notes:

- opam is not thread-safe, so we need a lock around all opam file
  parsing operations. See ocaml/opam#5591

- We now load opam files lazily, which makes testing much faster.

- Epoch_lock is gone. It was just for managing sub-processes, but a
  simple Eio.Mutex.t will do now. This should also be much faster when
  switching between opam-repository commits.

- The main executable is now in the `bin` directory. Simplifies the dune
  file. The command-line API has changed a bit.

- test_service.ml is gone. It was only testing parsing of the
  communications between parent and child processes, which no longer
  exist. This also removed the need to functorise over opam repository.
  Instead, there are some new tests which update a local mock opam
  repository and use the solver's public API.

- Default to `recommended_domain_count - 1` workers.

- Removed uses of (non-threadsafe) Str module.

- OCluster support no longer exports host metrics. That was just a quick
  hack for monitoring, but should really be a separate service.

- Combined solver-service and solver-worker into a single package
  (but still two libraries). Although the plain solver service avoids
  some dependencies, we don't really have a use for it. It would
  probably make sense to move the child-pipe logic into ocaml-ci for
  local use, so it can avoid the runtime dependency if it wants to.

- Errors parsing opam files are now reported as serious errors, not just
  as platforms without a solution.

- The stress tests now do a few solves as a warm-up, then do another few
  rounds with the same packages (checking it gets the same results as
  during the warm-up). It also now checks that the warm-up succeeded
  (although it doesn't check that the selection is correct).

  Note that the stress tests now uses fixed platform variables.
  Previously, they tried to solve for the platform running the stress
  client, which seemed odd.

- There is a single stress test executable which can test both the
  service and the cluster. Previously, they performed different tests.

- Removed internal-workers from docker-compose, as the default is likely
  better.

- Removed the OCluster connection logic from stress tests to simplify
  things. This means it won't auto-reconnect if the connection fails
  during the test, but that's probably a good thing.
@kit-ty-kate
Copy link
Member

I had a shot at debugging this issue. The main problem is actually ocamlyacc that is not thread-safe (not just not domain-safe). Adding a mutex to opam-file-format makes it work but is fairly ugly. The OCaml manual encourages to use menhir instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants