Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

dslomov · 2016-09-14T15:24:21Z

Broken on CI like this:
http://ci.bazel.io/job/bazel-tests/BAZEL_VERSION=HEAD,PLATFORM_NAME=linux-x86_64/223/testReport/junit/(root)/(empty)/test_consistent_command_line_encoding/

(other platforms too).
Works locally on some machines.
Probably related to #1766

aehlig · 2016-09-15T09:39:10Z

Works locally on some machines.

Some remarks:

At least on our CI, it seems to coincide with the java version (the older
one, running on ubuntu 14.04, causing the problem).
The sequence '\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x84\xC3\x96\xC3\x9C\xC3\x9F'
gets "re-encoded" as '??????????????'---literally 14 question marks (0x3F)!

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

aehlig · 2016-09-16T12:44:18Z

The error happens before RunCommand does anything. printf-debugging with the
following diff

diff --git a/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java b/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
index 26dda27..e3d761f 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
@@ -14,6 +14,10 @@

 package com.google.devtools.build.lib.runtime.commands;

+import static java.nio.charset.StandardCharsets.ISO_8859_1;
+import static java.nio.charset.StandardCharsets.UTF_8;
+
+
 import com.google.common.annotations.VisibleForTesting;
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.Lists;
@@ -130,6 +134,23 @@ public class RunCommand implements BlazeCommand  {
     // This list should look like: ["//executable:target", "arg1", "arg2"]
     List<String> targetAndArgs = options.getResidue();

+    // DEBUG ONLY
+    for (String arg: targetAndArgs) {
+      System.out.println("XXX");
+      System.out.println(arg);
+      System.out.print("UTF-8 bytes: ");
+      for (byte b : arg.getBytes(UTF_8)) {
+        System.out.print(String.format("%02x ", b));
+      }
+      System.out.println();
+      System.out.print("ISO-8859-1 bytes: ");
+      for (byte b : arg.getBytes(ISO_8859_1)) {
+        System.out.print(String.format("%02x ", b));
+      }
+      System.out.println();
+
+    }
+
     // The user must at the least specify an executable target.
     if (targetAndArgs.isEmpty()) {

Shows that the questionmarks already fall out of the parser (with
the problem probably being earlier); indeed, in the log of the failing
test we find the following.

WARNING: Running Bazel server needs to be killed, because the startup options are different.
XXX
//foo
UTF-8 bytes: 2f 2f 66 6f 6f
ISO-8859-1 bytes: 2f 2f 66 6f 6f
XXX
??????????????
UTF-8 bytes: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
ISO-8859-1 bytes: 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f
____Loading package: foo
____Loading...

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

aehlig · 2016-09-16T13:33:54Z

Further printf-debugging shows that the problem is in the way bazel is started.

In src/main/cpp/blaze_util_posix.cc b/src/main/cpp/blaze_util_posix.cc
argv[n-1] consists of the octets c3 a4 c3 b6 c3 bc c3 84 c3 96 c3 9c c3 9f 00.
With that argv, we do execv(exe.c_str(), const_cast<char **>(argv));
In src/main/java/com/google/devtools/build/lib/bazel/BazelMain.java's
public static void main(String[] args), the last String of args consists
of question marks (UTF-8 bytes: ef bf bd ef bf bd ef bf bd ef bf bd ef bf
bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
ef bf bd, ISO-8859-1 bytes: 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f).

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

aehlig · 2016-09-16T18:02:06Z

OK, so adding

setenv("LANG", "en_US.UTF-8", 1);
setenv("LC_ALL", "en_US.UTF-8", 1);

in front of the the execv in src/main/cpp/blaze_util_posix.cc (and
including <stdlib.h>) the java invocation gets the arguments without
information loss: the String obtained has as UTF-8 encoding the correct
bytes c3 a4 c3 b6 c3 bc c3 84 c3 96 c3 9c c3 9f. However, when the
the subprocess is called, the string gets rewritten in the default
encoding (which we set to ISO-8859-1) and there it is the octet
sequence e4 f6 fc c4 d6 dc df.

On the other hand, setting the locale to ISO-8859-1 or C, the startup
of java already rewrites the argument to the question marks, so we have
no chance of recovering the original information.

So at the moment, I have no idea on how to fix that encoding mess. On
the other hand, that problem probably has existed since bazel was open
sourced, so probably it is not that urgent.

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

sgowroji · 2023-01-23T09:17:30Z

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards #374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale. This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass. Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK. * Replace ad-hoc conversion logic with the new consistent set of helper functions. * Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior. * Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line. * Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters. * Read the downloader config using Bazel's filesystem implementation. * Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this). * Correctly reencode path strings for `LocalDiffAwareness`. * Correctly reencode the value of `user.dir`. * Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775. * Fix encoding issues in `TargetCompleteEvents`. * Fix encoding issues in `SubprocessFactory` implementations. * Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now. * Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths. Fixes bazelbuild#1775. Fixes bazelbuild#11602. Fixes bazelbuild#18293. Work towards bazelbuild#374. Work towards bazelbuild#23859. Closes bazelbuild#24010. PiperOrigin-RevId: 694114597 Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23

dslomov added P1 I'll work on this now. (Assignee required) category: misc > misc labels Sep 14, 2016

dslomov assigned aehlig Sep 14, 2016

dslomov changed the title ~~Fix (and reenable) actions_env_test.test_consistent_command_line_encoding~~ Fix (and reenable) run_test.test_consistent_command_line_encoding Sep 14, 2016

aehlig added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Sep 17, 2016

aehlig added this to the 1.0 milestone Sep 17, 2016

aehlig removed their assignment Feb 1, 2020

jmmv mentioned this issue May 11, 2020

Inconsistent encoding when running bazel --batch run //foo -- "$arg" on OS X #1766

Closed

meisterT removed this from the 1.0 milestone May 12, 2020

meisterT added area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling and removed category: misc > misc labels May 15, 2020

jin added the type: bug label May 19, 2020

philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020

sgowroji closed this as completed Jan 23, 2023

fmeum reopened this Oct 16, 2024

fmeum self-assigned this Oct 16, 2024

fmeum mentioned this issue Oct 16, 2024

Fix most Unicode encoding bugs #24010

Closed

copybara-service bot closed this as completed in a58fe3f Nov 7, 2024

iancha1992 mentioned this issue Nov 8, 2024

[8.0.0] Fix most Unicode encoding bugs #24260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

dslomov commented Sep 14, 2016

aehlig commented Sep 15, 2016

aehlig commented Sep 16, 2016

aehlig commented Sep 16, 2016

aehlig commented Sep 16, 2016

sgowroji commented Jan 23, 2023

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

Comments

dslomov commented Sep 14, 2016

aehlig commented Sep 15, 2016

aehlig commented Sep 16, 2016

aehlig commented Sep 16, 2016

aehlig commented Sep 16, 2016

sgowroji commented Jan 23, 2023