Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETXTBSY during Coursier fetch #13424

Open
jsirois opened this issue Oct 30, 2021 · 8 comments
Open

ETXTBSY during Coursier fetch #13424

jsirois opened this issue Oct 30, 2021 · 8 comments
Labels
backend: JVM JVM backend-related issues bug estimate: <1W

Comments

@jsirois
Copy link
Contributor

jsirois commented Oct 30, 2021

Observed on main @ cbee9ff

$ ./build-support/bin/generate_all_lockfiles.sh 
Bootstrapping pants_deps with requirements:
...
Successfully installed PyYAML-6.0 ansicolors-1.1.8 attrs-21.2.0 certifi-2021.10.8 charset-normalizer-2.0.7 fasteners-0.16.3 freezegun-1.1.0 humbug-0.2.7 idna-3.3 ijson-3.1.4 importlib-metadata-4.8.1 iniconfig-1.1.1 packaging-21.0 pex-2.1.53 pluggy-1.0.0 psutil-5.8.0 py-1.10.0 pyparsing-3.0.4 pystache-0.5.4 pytest-6.2.5 python-dateutil-2.8.2 requests-2.26.0 setproctitle-1.2.2 setuptools-57.5.0 six-1.16.0 toml-0.10.2 types-PyYAML-5.4.3 types-freezegun-0.1.4 types-requests-2.25.0 types-setuptools-57.0.0 types-toml-0.1.3 typing-extensions-3.10.0.2 urllib3-1.26.7 zipp-3.6.0
WARNING: You are using pip version 20.1.1; however, version 21.3.1 is available.
You should consider upgrading via the '/home/jsirois/.cache/pants/pants_dev_deps/Linux.x86_64.unknown.py37.venv/bin/python3.7 -m pip install --upgrade pip' command.
[=== 00:22 Building native code... ===]
   Compiling either v1.6.1
...
   Compiling engine v0.0.1 (/home/jsirois/dev/pantsbuild/jsirois-pants/src/rust/engine)
    Finished release [optimized + debuginfo] target(s) in 3m 23s
13:48:39.62 [INFO] Starting: Resolving plugins: hdrhistogram, toolchain.pants.plugin==0.15.0
13:48:53.02 [INFO] Completed: Resolving plugins: hdrhistogram, toolchain.pants.plugin==0.15.0
13:48:53.46 [INFO] Initializing scheduler...
13:48:53.59 [INFO] Scheduler initialized.
13:48:53.66 [WARN] Error loading access token: AuthError('Failed to load auth token (no default file or environment variable). Run `./pants auth-acquire` to set up authentication.')
13:48:53.66 [WARN] Auth failed - BuildSense plugin is disabled.
13:49:09.21 [INFO] Completed: Building generate_all_lockfiles_helper.pex with 10 requirements: PyYAML<7.0,>=6.0, ansicolors==1.1.8, packaging==21.0, pex==2.1.53, setuptools<58.0,>=56.0.0, toml==0.10.2, types-PyYAML==5.4.3, types-s... (65 characters truncated)
13:49:09.46 [INFO] Completed: Building local_dists.pex
13:49:10.97 [INFO] Starting: Resolving plugins: hdrhistogram, toolchain.pants.plugin==0.15.0
13:49:22.82 [INFO] Completed: Resolving plugins: hdrhistogram, toolchain.pants.plugin==0.15.0
13:49:23.31 [WARN] Error loading access token: AuthError('Failed to load auth token (no default file or environment variable). Run `./pants auth-acquire` to set up authentication.')
13:49:23.31 [WARN] Auth failed - BuildSense plugin is disabled.
13:49:54.28 [INFO] Completed: Building poetry.pex with 1 requirement: poetry==1.1.8
13:50:41.21 [INFO] Completed: Generate lockfile for flake8
13:53:30.78 [INFO] Completed: Generate lockfile for pytest
13:53:31.35 [INFO] Wrote lockfile for the resolve `pytest` to 3rdparty/python/lockfiles/pytest.txt
13:53:31.35 [INFO] Wrote lockfile for the resolve `flake8` to 3rdparty/python/lockfiles/flake8.txt
13:53:51.93 [INFO] Completed: Building dockerfile_parser.pex from dockerfile-parser_default_lockfile.txt
13:54:29.16 [ERROR] Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 235, in _run_inner
    return self._perform_run(goals)
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 174, in _perform_run
    return self._perform_run_body(goals, poll=False)
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 196, in _perform_run_body
    poll_delay=(0.1 if poll else None),
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/init/engine_initializer.py", line 135, in run_goal_rules
    goal_product, params, poll=poll, poll_delay=poll_delay
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/internals/scheduler.py", line 548, in run_goal_rule
    self._raise_on_error([t for _, t in throws])
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/internals/scheduler.py", line 512, in _raise_on_error
    wrapped_exceptions=tuple(t.exc for t in throws),

Exception message: 1 Exception encountered:

Engine traceback:
  in select
  in pants.backend.experimental.python.user_lockfiles.generate_user_lockfile_goal
  in pants.engine.internals.graph.transitive_targets
  in pants.engine.internals.graph.transitive_dependency_mapping
  in pants.engine.internals.graph.resolve_targets (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.engine.internals.graph.resolve_unexpanded_targets (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.engine.internals.graph.resolve_dependencies (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.backend.java.dependency_inference.rules.infer_java_dependencies_via_imports (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.backend.java.dependency_inference.package_mapper.merge_first_party_module_mappings
  in pants.backend.java.dependency_inference.package_mapper.map_first_party_java_targets_to_symbols
  in pants.backend.java.dependency_inference.java_parser.resolve_fallible_result_to_analysis
  in pants.backend.java.dependency_inference.java_parser.analyze_java_source_dependencies
  in pants.backend.java.dependency_inference.java_parser_launcher.build_processors
  in pants.jvm.resolve.coursier_fetch.materialize_classpath
  in pants.jvm.resolve.coursier_fetch.coursier_fetch_lockfile
  in pants.jvm.resolve.coursier_fetch.coursier_fetch_one_coord
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/process.py", line 278, in fallible_to_exec_result_or_raise
    local_cleanup=global_options.options.process_execution_local_cleanup,
pants.engine.process.ProcessExecutionFailure: Process 'Resolving with coursier: com.google.errorprone:error_prone_annotations:2.5.1' failed with exit code 126.
stdout:

stderr:
+ coursier_exe=./cs-x86_64-pc-linux
+ shift
+ json_output_file=coursier_report.json
+ shift
+ ./cs-x86_64-pc-linux fetch --json-output-file=coursier_report.json --intransitive com.google.errorprone:error_prone_annotations:2.5.1
coursier_wrapper_script.sh: line 8: ./cs-x86_64-pc-linux: Text file busy



Use --no-process-execution-local-cleanup to preserve process chroots for inspection.

Traceback (most recent call last):
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 235, in _run_inner
    return self._perform_run(goals)
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 174, in _perform_run
    return self._perform_run_body(goals, poll=False)
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/bin/local_pants_runner.py", line 196, in _perform_run_body
    poll_delay=(0.1 if poll else None),
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/init/engine_initializer.py", line 135, in run_goal_rules
    goal_product, params, poll=poll, poll_delay=poll_delay
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/internals/scheduler.py", line 548, in run_goal_rule
    self._raise_on_error([t for _, t in throws])
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/internals/scheduler.py", line 512, in _raise_on_error
    wrapped_exceptions=tuple(t.exc for t in throws),
pants.engine.internals.scheduler.ExecutionError: 1 Exception encountered:

Engine traceback:
  in select
  in pants.backend.experimental.python.user_lockfiles.generate_user_lockfile_goal
  in pants.engine.internals.graph.transitive_targets
  in pants.engine.internals.graph.transitive_dependency_mapping
  in pants.engine.internals.graph.resolve_targets (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.engine.internals.graph.resolve_unexpanded_targets (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.engine.internals.graph.resolve_dependencies (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.backend.java.dependency_inference.rules.infer_java_dependencies_via_imports (src/python/pants/backend/java/dependency_inference/PantsJavaParserLauncher.java:javaparser)
  in pants.backend.java.dependency_inference.package_mapper.merge_first_party_module_mappings
  in pants.backend.java.dependency_inference.package_mapper.map_first_party_java_targets_to_symbols
  in pants.backend.java.dependency_inference.java_parser.resolve_fallible_result_to_analysis
  in pants.backend.java.dependency_inference.java_parser.analyze_java_source_dependencies
  in pants.backend.java.dependency_inference.java_parser_launcher.build_processors
  in pants.jvm.resolve.coursier_fetch.materialize_classpath
  in pants.jvm.resolve.coursier_fetch.coursier_fetch_lockfile
  in pants.jvm.resolve.coursier_fetch.coursier_fetch_one_coord
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/home/jsirois/dev/pantsbuild/jsirois-pants/src/python/pants/engine/process.py", line 278, in fallible_to_exec_result_or_raise
    local_cleanup=global_options.options.process_execution_local_cleanup,
pants.engine.process.ProcessExecutionFailure: Process 'Resolving with coursier: com.google.errorprone:error_prone_annotations:2.5.1' failed with exit code 126.
stdout:

stderr:
+ coursier_exe=./cs-x86_64-pc-linux
+ shift
+ json_output_file=coursier_report.json
+ shift
+ ./cs-x86_64-pc-linux fetch --json-output-file=coursier_report.json --intransitive com.google.errorprone:error_prone_annotations:2.5.1
coursier_wrapper_script.sh: line 8: ./cs-x86_64-pc-linux: Text file busy



Use --no-process-execution-local-cleanup to preserve process chroots for inspection.


Use -ldebug for more logs. 
See https://www.pantsbuild.org/v2.8/docs/troubleshooting for common issues.
Consider reaching out for help: https://www.pantsbuild.org/v2.8/docs/getting-help
@jsirois jsirois added bug backend: JVM JVM backend-related issues labels Oct 30, 2021
@jsirois
Copy link
Contributor Author

jsirois commented Oct 30, 2021

I'm pretty sure this is because the executable is not in the chroot (bash):

process_result = await Get(
ProcessResult,
Process(
argv=coursier.args(
[coursier_report_file_name, "--intransitive", request.coord.to_coord_str()],
wrapper=[bash.path, coursier.wrapper_script],
),
input_digest=coursier.digest,
output_directories=("classpath",),
output_files=(coursier_report_file_name,),
append_only_caches=coursier.append_only_caches,
env=coursier.env,
description=f"Resolving with coursier: {request.coord.to_coord_str()}",
level=LogLevel.DEBUG,
),
)

But my mitigation of yore looks explicitly for that condition to trigger fancy retries:
// See the documentation of the `CapturedWorkdir::run_in_workdir` method, but `exclusive_spawn`
// indicates the binary we're spawning was written out by the current thread, and, as such,
// there may be open file handles against it. This will occur whenever a concurrent call of this
// method proceeds through its fork point
// (https://pubs.opengroup.org/onlinepubs/009695399/functions/fork.html) while the current
// thread is in the middle of writing the binary and thus captures a clone of the open file
// handle, but that concurrent call has not yet gotten to its exec point
// (https://pubs.opengroup.org/onlinepubs/009695399/functions/exec.html) where the operating
// system will close the cloned file handle (via O_CLOEXEC being set on all files opened by
// Rust). To prevent a race like this holding this thread's binary open leading to an ETXTBSY
// (https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html) error, we
// maintain RwLock that allows non-`exclusive_spawn` binaries to spawn concurrently but ensures
// all such concurrent spawns have completed (and thus closed any cloned file handles) before
// proceeding to spawn the `exclusive_spawn` binary this thread has written.
//
// See: https://github.com/golang/go/issues/22315 for an excellent description of this generic
// unix problem.
let mut fork_exec = move || command.spawn(Stdio::piped(), Stdio::piped());
let mut child = {
if exclusive_spawn {
let _write_locked = self.spawn_lock.write().await;
// Despite the mitigations taken against racing our own forks, forks can happen in our
// process but outside of our control (in libraries). As such, we back-stop by sleeping and
// trying again for a while if we do hit one of these fork races we do not control.
const MAX_ETXTBSY_WAIT: Duration = Duration::from_millis(100);
let mut retries: u32 = 0;
let mut sleep_millis = 1;
let start_time = std::time::Instant::now();
loop {
match fork_exec() {
Err(e) => {
if e.raw_os_error() == Some(libc::ETXTBSY) && start_time.elapsed() < MAX_ETXTBSY_WAIT
{
tokio::time::sleep(std::time::Duration::from_millis(sleep_millis)).await;
retries += 1;
sleep_millis *= 2;
continue;
} else if retries > 0 {
break Err(format!(
"Error launching process after {} {} for ETXTBSY. Final error was: {:?}",
retries,
if retries == 1 { "retry" } else { "retries" },
e
));
} else {
break Err(format!("Error launching process: {:?}", e));
}
}
Ok(child) => break Ok(child),
}
}
} else {
let _read_locked = self.spawn_lock.read().await;
fork_exec().map_err(|e| format!("Error launching process: {:?}", e))
}
}?;

It's probably the case that we just always need to use the mitigation unconditionally.

@jsirois jsirois self-assigned this Oct 30, 2021
jsirois added a commit to jsirois/pants that referenced this issue Oct 30, 2021
Some of our Processes use system binaries to run scripts in the input
digest of the Process. This foiled the original ETXTBSY fix which was
only used when argv0 was a member of the input digest. Coursier fetch
was an example of this, using the system provided bash to execute a
script in the input digest which in turn executed a binary in the input
digest.

Fix this by just pessimistically assuming all Process invocations will
execute files in the input digest and always performing the mitigation.

Fixes pantsbuild#13424

[ci skip-build-wheels]
@jsirois
Copy link
Contributor Author

jsirois commented Oct 31, 2021

Hrm, I just realized always applying the ETXTBSY mitigation won't help this case at all.

To clarify this case, we have the following spawn chain:

pants engine
 -> /usr/bin/bash ./coursier_wrapper_script.sh
    -> ./cs-x86_64-pc-linux

In other words; although this fix originates from the pants engine code and mitigates the top level process encountering ETXTBSY, it does nothing for the child ./cs-x86_64-pc-linux process spawn since that spawn is fully controlled by bash and runs concurrent-to-or-after our process spawn of /usr/bin/bash ./coursier_wrapper_script.sh returns control to us and we exit our lock. In #13424 it was exactly that child process spawn from within bash that had the issue.

The only idea I have is to add an intrinsic shim launcher binary that can be requested as part of an input digest that contains the same code as the relevant part of the Pants engine. When that binary is present in the input digest, the pants engine executes it noop using the mitigation to prove its viable; then proceeds with the user process invocation. Inside the user process, instead of invoking processes directly in bash, they use the shim launcher binary. So instead of ./cs-x86_64-pc-linux ... the bash would use ./spawn ./cs-x86_64-pc-linux ... assuming the shim launcher binary was in the input digest at ./spawn.

@jsirois
Copy link
Contributor Author

jsirois commented Oct 31, 2021

I guess - more simply - the spawn shim could be requested in the Process struct and then the launch from Rust would be of [./spawn, *args]. The existing mitigation would be applied to this and implicitly prove ./spawn was viable before the bash script it indirectly launches in-turn tried to use ./spawn.

@tdyas
Copy link
Contributor

tdyas commented Oct 31, 2021

As background for other readers: Longer-term fixes in the Linux kernel still seem to face skepticism: I found the most recent attempt to get a O_CLOFORK open flag added to Linux (which would fix the issue by closing the descriptors upon fork so that the child processes never see them): https://lkml.org/lkml/2020/4/20/117.

Maybe we should start advocating for that change? Maybe even team up with other build systems (e.g., Bazel)?

@tdyas
Copy link
Contributor

tdyas commented Oct 31, 2021

For a near-term solution, would the following work (at least on Linux)?

Spawn a thread pool that will be used for writing out files to execution sandboxes. When the threads first boot, they call the unshare system call to unshare their file descriptors. Since the threads continue to share memory with the other threads in the process, they can still receive requests via a shared queue (since mutex / condvar should just depend on shared memory). Since the threads in the write-out thread pool will never fork and don't share file descriptors with the other threads in the process, any file descriptors used by a write-out thread should not leak into a child process.

For macOS (or other UNIX systems without an unshare or clone system call), Pants could spawn a Rust-only process to handle the sandbox setup and receive requests over a pipe.

There is some prior art for this: Buildbox (used by BuildGrid) has the notion of a "Local CAS Service" which includes sandbox setup via a StageTree gRPC request. https://gitlab.com/BuildGrid/buildbox/buildbox-common/-/blob/master/protos/build/buildgrid/local_cas.proto#L94

An alternate solution might just adopt the gRPC approach instead of using unshared threads. Pants will spawn a separate sandbox-setup process which receives gRPC requests to setup sandboxes. This might assist with using FUSE for sandbox setup by delegating the whole sandbox setup to a separate process which can choose how to setup the sandbox.

@jsirois
Copy link
Contributor Author

jsirois commented Nov 1, 2021

I like the separate process idea. There is 0 ambiguity then and it works for all OSes we'll support.

@stuhood
Copy link
Member

stuhood commented Nov 1, 2021

Doing this work from dedicated threads or a separate process sounds feasible. But I think that at least for the next few weeks we should likely hack in a workaround until we've finished stabilizing the JVM: essentially, edit the wrapper scripts to include a bash retry loop. That's obviously not scalable, but the board is pretty full: https://github.com/pantsbuild/pants/projects/22?fullscreen=true , and I'd like to clear up a few more unknowns before investing time here.

@stuhood
Copy link
Member

stuhood commented Dec 13, 2021

#13848 should have made this issue much less frequent, since we will materialize the coursier binary exactly once per run.

stuhood added a commit that referenced this issue Mar 16, 2022
…ies. (#14812)

The `exclusive_spawn` facility to avoid/retry for `ExecutableFileBusy` / "Text file busy" is triggered by having materialized `arg[0]` of a process into the process sandbox. But in the presence of a `Process::working_directory` and a relative path as `arg[0]`, the facility was not being triggered (since validation of `arg[0]` as a `RelativePath` would fail due to it escaping its root).

This relates to #13424 (which would remove the need for the `exclusive_spawn` facility), but does not fix it: only ensure that we handle an existing known case.

[ci skip-build-wheels]
stuhood added a commit to stuhood/pants that referenced this issue Mar 16, 2022
…ies. (pantsbuild#14812)

The `exclusive_spawn` facility to avoid/retry for `ExecutableFileBusy` / "Text file busy" is triggered by having materialized `arg[0]` of a process into the process sandbox. But in the presence of a `Process::working_directory` and a relative path as `arg[0]`, the facility was not being triggered (since validation of `arg[0]` as a `RelativePath` would fail due to it escaping its root).

This relates to pantsbuild#13424 (which would remove the need for the `exclusive_spawn` facility), but does not fix it: only ensure that we handle an existing known case.

[ci skip-build-wheels]
stuhood added a commit that referenced this issue Mar 16, 2022
…ies. (cherrypick of #14812) (#14816)

The `exclusive_spawn` facility to avoid/retry for `ExecutableFileBusy` / "Text file busy" is triggered by having materialized `arg[0]` of a process into the process sandbox. But in the presence of a `Process::working_directory` and a relative path as `arg[0]`, the facility was not being triggered (since validation of `arg[0]` as a `RelativePath` would fail due to it escaping its root).

This relates to #13424 (which would remove the need for the `exclusive_spawn` facility), but does not fix it: only ensure that we handle an existing known case.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: JVM JVM backend-related issues bug estimate: <1W
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants