Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctx.actions.run fails to invoke batch scripts when running via RBE on Windows #11636

Closed
sunjayBhatia opened this issue Jun 24, 2020 · 15 comments
Closed
Labels
area-Windows Windows-specific issues and feature requests P3 We're not considering working on this, but happy to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug

Comments

@sunjayBhatia
Copy link

Description of the problem / feature request:

Executing a batch script with ctx.actions.run fails when running via RBE on Windows. Running the exact same build locally succeeds without a hitch. It appears Bazel may be passing an Unix style path rather than a Windows path when executing the rule action in RBE. From the linked minimal repro, we see:

ERROR: C:/source/BUILD:7:1: Couldn't build file output_regular_path.txt: BatchExecuteWithRegularPath output_regular_path.txt failed (Exit 1): script_regular_path.bat failed: error executing command
  cd C:/_eb/execroot/ctx_actions_run_rbe
  SET SOME_ENV_VAR=some_value
  bazel-out/x64_windows-fastbuild/bin/script_regular_path.bat
Execution platform: @rbe_windows_msvc_cl//config:platform
'bazel-out' is not recognized as an internal or external command,
operable program or batch file.
ERROR: C:/source/BUILD:3:1: Couldn't build file output_windows_path.txt: BatchExecuteWithWindowsPath output_windows_path.txt failed (Exit 1): script_windows_path.bat failed: error executing command
  cd C:/_eb/execroot/ctx_actions_run_rbe
  SET SOME_ENV_VAR=some_value
  bazel-out/x64_windows-fastbuild/bin/script_windows_path.bat
Execution platform: @rbe_windows_msvc_cl//config:platform
'bazel-out' is not recognized as an internal or external command,
operable program or batch file.

which appears that cmd.exe is interpreting bazel-out as the command to run and not the full path. This does not occur when running the rule locally. When we purposefully use a Windows path string for the executable field of the rule action, it makes no difference, we get the same result as when we use a File type.

This issue could possibly be fixed by performing the appropriate quoting/passing the correct flags to cmd.exe or ensuring a Windows style path is always used.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

See: https://github.com/greenhouse-org/bazel-issue-repro/tree/master/ctx_actions_run_rbe

The README in this repo should have details for reproducing the problem and example failure output.

What operating system are you running Bazel on?

Windows

What's the output of bazel info release?

release 3.1.0

Any other information, logs, or outputs that you want to share?

We ran into this in the Envoy project as a result of bumping to the latest rules_go which switched from running go tool commands via bash shell to the native ctx.actions.run on Windows.

Using the --experimental_remote_grpc_log flag to generate the grpc log for a failing RBE build and the remote client tool we are able to see the failed and successful actions and representation of the commands that are run. It appears from the output below that Bazel is sending a command to the RBE service that uses an invalid path for a batch script.

I believe the invocation ID from this run was 8d806e2e-c3cb-4a69-becb-05571c76f75c, (but I may be mixing up attempts).

One of the failed actions from the issue repro (digest 89a618e0b38a6c9967729316db828b66f4a5e7d7764c514bcd83e87b9b32c8d5/141):

$ ./bazel-bin/remote_client $REMOTE_CLIENT_FLAGS --grpc_log=$PWD/grpc.log show_action --digest=89a618e0b38a6c9967729316db828b66f4a5e7d7764c514bcd83e87b9b32c8d5/141
Command [digest: e51efdab6899ffd275023f75f791bd380375d55cdab095f1ee81cb01c7305615/312]:
SOME_ENV_VAR=some_value \
  bazel-out/x64_windows-fastbuild/bin/script_regular_path.bat
 
Input files [total: 1, root Directory digest: 47f9f37722730278c0a8568b885c47f0a0fbb36926d0e2712bf1fefe1420c259/83]:
bazel-out [Directory digest: 65b4e6b8655301599641f88388ca0cdc2a06b7b1c91efd2984ef1b1039889e6e/95]
bazel-out/x64_windows-fastbuild [Directory digest: f20d65b30f1cca9b088733264a48e41dfd8884e251b3896a34b98a64ca935617/77]
bazel-out/x64_windows-fastbuild/bin [Directory digest: 357fe978c27bd53d468f1e982c8567691a8cc46a3010647af5597f3d568d6617/99]
bazel-out/x64_windows-fastbuild/bin/script_regular_path.bat [File content digest: 621fd3ac7095509c3d715705c884365ac4aa812e47cd10c5de77576dc5bc9461/70]
 
Output files:
bazel-out/x64_windows-fastbuild/bin/output_regular_path.txt
 
Output directories:
(none)
 
Platform:
properties {
  name: "OSFamily"
  value: "Windows"
}
properties {
  name: "container-image"
  value: "docker://gcr.io/envoy-ci/envoy-build-windows@sha256:02d4ff5c2e4c703944e4ec3770c5fa51cdfc6781f95607e91648e19c14b38346"
}

Another failed action looks very similar (digest 27e66ea85ab4978fb639e57e30206b8e5a598bade43746e55c46250e42c6d58d/141).

The successful action from the issue repro (digest 4306f162f55bdc0549d4d986d63ce2985e7a647b3c92e8f9fdef3b4b9ef903ad/139):

$ ./bazel-bin/remote_client $REMOTE_CLIENT_FLAGS --grpc_log=$PWD/grpc.log show_action --digest=4306f162f55bdc0549d4d986d63ce2985e7a647b3c92e8f9fdef3b4b9ef903ad/139
Command [digest: 95f4252117f582c920e0f5eb2fb9cc916468ae79d27d30facf011cef9c245da1/417]:
SOME_ENV_VAR=some_value \
  cmd.exe /S /C '(echo %cd% & echo --- & dir & echo --- & dir C:\ & echo --- & set) > bazel-out\x64_windows-fastbuild\bin\output_execution_environment.txt'
 
Input files [total: 0, root Directory digest: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855/0]:
 
Output files:
bazel-out/x64_windows-fastbuild/bin/output_execution_environment.txt
 
Output directories:
(none)
 
Platform:
properties {
  name: "OSFamily"
  value: "Windows"
}
properties {
  name: "container-image"
  value: "docker://gcr.io/envoy-ci/envoy-build-windows@sha256:02d4ff5c2e4c703944e4ec3770c5fa51cdfc6781f95607e91648e19c14b38346"
}

Commenting out the inputs field of the //:batch_execute_rule_windows_path rule, running locally the build fails with:

> bazel --output_base=C:/_eb build --verbose_failures --keep_going //:batch_execute_rule_windows_path
Starting local Bazel server and connecting to it...
INFO: Analyzed target //:batch_execute_rule_windows_path (4 packages loaded, 6 targets configured).
INFO: Found 1 target...
ERROR: C:/source/BUILD:3:1: Couldn't build file output_windows_path.txt: BatchExecuteWithWindowsPath output_windows_path.txt failed (Exit -1): script_windows_path.bat failed: error executing command
  cd C:/_eb/execroot/ctx_actions_run_rbe
  SET SOME_ENV_VAR=some_value
  bazel-out/x64_windows-fastbuild/bin/script_windows_path.bat
Execution platform: @local_config_platform//:host. Note: Remote connection/protocol failed with: execution failed
Action failed to execute: java.io.IOException: ERROR: src/main/native/windows/process.cc(199): CreateProcessW("C:\_eb\execroot\ctx_actions_run_rbe\bazel-out\x64_windows-fastbuild\bin\script_windows_path.bat"): The system cannot find the file specified.
(error: 2)
Target //:batch_execute_rule_windows_path failed to build

It somewhat makes sense, there is not file input to the rule and the path is converted properly, however the fact that Bazel is trying to run a .bat file with CreateProcessW is a bit odd. The exact same error is output when use an unaltered path string instead of a File type (without replacing / with \\) for the executable.

Running the same thing remotely, the build fails with the stack trace and error:

ERROR: C:/source/BUILD:3:1: Couldn't build file output_windows_path.txt: BatchExecuteWithWindowsPath output_windows_path.txt failed (Exit 34): java.io.IOException: com.google.devtools.build.lib.remote.ExecutionStatusException: INVALID_ARGUMENT: docker: Error response from daemo
n: container ebc98b3709257313a06f0c966af348bb1fdeba4baefaede90810ee34fd62bf31 encountered an error during CreateProcess: failure in a Windows system call: The system cannot find the file specified. (0x2) extra info: {"CommandLine":"bazel-out/x64_windows-fastbuild/bin/script_win
dows_path.bat","WorkingDirectory":"C:\\botcode\\w","Environment":{"HOST_CONTAINER_NAME":"rbe-container-34d0f566-c928-4b66-b765-361a4a5bd7c6","MSYS2_ARG_CONV_EXCL":"*","SOME_ENV_VAR":"some_value","TEMP":"C:\\Windows\\Temp","TMP":"C:\\Windows\\Temp","TMPDIR":"C:\\Windows\\Temp"},
"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0]}.
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:192)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:324)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:304)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:238)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:126)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:96)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:39)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:327)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:961)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1109)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:1080)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:137)
        at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)
        at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:601)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:907)
        at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:297)
        at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
        at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:399)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
Caused by: com.google.devtools.build.lib.remote.ExecutionStatusException: INVALID_ARGUMENT: docker: Error response from daemon: container ebc98b3709257313a06f0c966af348bb1fdeba4baefaede90810ee34fd62bf31 encountered an error during CreateProcess: failure in a Windows system cal$
ONV_EXCL":"*","SOME_ENV_VAR":"some_value","TEMP":"C:\\Windows\\Temp","TMP":"C:\\Windows\\Temp","TMPDIR":"C:\\Windows\\Temp"},"CreateStdInPipe":true,"CreateStdOutPipe":true,"CreateStdErrPipe":true,"ConsoleSize":[0,0]}.
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.handleStatus(GrpcRemoteExecutor.java:69)
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.getOperationResponse(GrpcRemoteExecutor.java:81)
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.lambda$executeRemotely$0(GrpcRemoteExecutor.java:155)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:116)
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:134)
        ... 24 more

It seems something along the way isn’t converting to the executable path to a Windows path when we’re executing remotely, though the actual working directory seems to be set appropriately for the remote container environment.

@sunjayBhatia
Copy link
Author

cc @wrowe @davinci26 @nigriMSFT

@oquenchil
Copy link
Contributor

cc @meteorcloudy

@jin jin added area-Windows Windows-specific issues and feature requests team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug untriaged labels Jun 29, 2020
@meteorcloudy
Copy link
Member

Yeah, Windows requires the path of the binary to run has to be a Windows style path.

I believe why it succeeded locally is because we preprocess the arg0 (replacing / with ) in WindowsSubprocessFactory:

public static String processArgv0(String argv0) {
// Normalize the path and make it Windows-style.
// If argv0 is at least MAX_PATH (260 chars) long, createNativeProcess calls GetShortPathNameW
// to obtain a 8dot3 name for it (thereby support long paths in CreateProcessA), but then argv0
// must be prefixed with "\\?\" for GetShortPathNameW to work, so it also must be an absolute,
// normalized, Windows-style path.
// Therefore if it's absolute, then normalize it also.
// If it's not absolute, then it cannot be longer than MAX_PATH, since MAX_PATH also limits the
// length of file names.
PathFragment argv0fragment = PathFragment.create(argv0);
return argv0fragment.isAbsolute() ? argv0fragment.getPathString().replace('/', '\\') : argv0;
}

@EricBurnett I assume the is no such preprocessing in the RBE Windows runner? Can we add this support so that no matter how the binary path is passed, we can always run it on RBE?

@ulfjack
Copy link
Contributor

ulfjack commented Jun 29, 2020

It seems better (to me) to fix this on Bazel's side, than work around Bazel's shortcoming in the remote system. In the worst case, we can replace the path in the RemoteSpawnRunner, as long as we know that it's supposed to be run on Windows. Why does this work in all other cases, but not in this one?

@meteorcloudy
Copy link
Member

OK, I did some debugging and found another place we made sure the first argument is a valid path to execute.

// SubprocessBuilder does not accept relative paths for the first argument, even though
// Command does. We sometimes get relative paths here, so we need to handle it.
File argv0 = new File(args.get(0));
if (!argv0.isAbsolute() && argv0.getParent() != null) {
List<String> newArgs = new ArrayList<>(args);
newArgs.set(0, new File(execRoot.getPathFile(), newArgs.get(0)).getAbsolutePath());
args = ImmutableList.copyOf(newArgs);
}
subprocessBuilder.setArgv(args);

@ulfjack, from the blame history, you are the author of this, do you think it makes sense to implement the same logic in RemoteSpawnRunner?

Another thing I want to point out is that CreateProcessW can only take a single command string, but we pass a list of arguments to remote runner. So the remote systems do need to pay attention to how they parse the arguments. Ideally in the same way as

// DO NOT quote argv0, createProcess will do it for us.
String argv0 = processArgv0(argv.get(0));
String argvRest = argv.size() > 1 ? escapeArgvRest(argv.subList(1, argv.size())) : "";
byte[] env = convertEnvToNative(builder.getEnv());

@EricBurnett
Copy link

@meteorcloudy normally I'd say "bazel and RBE should both do what the RE-API specifies", but it's not totally clear on this point.

// The arguments to the command. The first argument must be the path to the
// executable, which must be either a relative path, in which case it is
// evaluated with respect to the input root, or an absolute path.
repeated string arguments = 1;

Elsewhere in the API (e.g. output_paths), relative paths are specified with '/' as a path specifier. I looked for 'absolute' but looks like the only place that made it into the spec is symlink targets, and it's (imho) wrong there: requiring absolute paths to start with '/'.

I think I've stated elsewhere my preference for absolute paths is that they be specified in os-specific terms - anything else is not expressive enough. (Windows alone has multiple valid absolute path formats). I'd suggest we document that here as well.

For relative paths, I'm not certain - sticking with '/' separated paths probably works, in which case it's the server responsibility to translate a "logical" relative path into the appropriate system-specific path for execution. That at least makes some sense, in that a relative path is really intended to be processed by the server, which has to at least transform it into the appropriate path for execution. (Noting our mis-specification that it's relative to the input root, but there may be a working_directory specified, in which case the server has to re-compute it based on the directory it's going to get executed in or make it absolute for execution anyways.)

I could also see the argument that it should be a relative path using os-specific separators, but given our use elsewhere in the API of '/' intentionally, I don't think that would do any harm here as well? In which case this becomes a request for spec clarification ( @bergsieker FYI), after which point it would become an RBE bug that we're not rewriting the path we execute from "logical" relative path to something system-specific for execution.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 29, 2020

The executable path may not be the first element of the arguments array, e.g., because of test and coverage wrappers, or a --run_under tool. If Bazel doesn't convert the paths correctly, every single such tool would have to do the translation, not just RBE.

@ulfjack
Copy link
Contributor

ulfjack commented Jun 29, 2020

(or RBE would have to guess or reverse engineer the entire command line, which seems clearly undesirable)

@EricBurnett
Copy link

Nested arguments should definitely be OS-specific, yes. I'm only thinking about the first one, which is special in the sense that the remote server must process and possibly modify it. But I'd also be fine choosing to specify that it must always be os-specific (in which case all arguments are then os-specific, as you'd probably expect) - I don't think it makes the remote server's job any harder, and may actually make it easier to process said relative path in an os-specific way.

@meteorcloudy
Copy link
Member

If Bazel doesn't convert the paths correctly, every single such tool would have to do the translation, not just RBE.

@ulfjack Do you want to help fixing the RemoteSpawnRunner?

@ulfjack
Copy link
Contributor

ulfjack commented Jun 29, 2020

I can help fix the problem. I'm not sure RemoteSpawnRunner is the right place for the fix, though - we need to find out where the path is coming from. Things are a bit hectic right now over here but should slow down this week.

@philwo philwo added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Nov 26, 2020
@philwo
Copy link
Member

philwo commented Nov 26, 2020

@ulfjack Hello :) How are things looking on this issue - would you have time to help with a fix?

I marked this P3 for now, meaning "we would happily take and review a fix for this".

@wrowe
Copy link
Contributor

wrowe commented Jan 13, 2021

Yeah, Windows requires the path of the binary to run has to be a Windows style path.

FYI, this is simply not true. Any "quoted" path containing forward slashes should be invoked correctly. Only when paths are unquoted does it become ambiguous whether the path name segments are part of the name or additional /options.

Using a quoted batch file pathname upon invocation should resolve this issue, resolves spaces in pathnames, and is portable to all platforms.

@github-actions
Copy link

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

@github-actions github-actions bot added the stale Issues or PRs that are stale (no activity for 30 days) label Apr 26, 2023
@github-actions
Copy link

This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage). Thanks!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Windows Windows-specific issues and feature requests P3 We're not considering working on this, but happy to review a PR. (No assignee) stale Issues or PRs that are stale (no activity for 30 days) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug
Projects
None yet
Development

No branches or pull requests

8 participants