Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

Closed
dslomov opened this issue Sep 14, 2016 · 5 comments
Closed

Fix (and reenable) run_test.test_consistent_command_line_encoding #1775

dslomov opened this issue Sep 14, 2016 · 5 comments
Assignees
Labels
area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling P2 We'll consider working on this in future. (Assignee optional) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug

Comments

@dslomov
Copy link
Contributor

dslomov commented Sep 14, 2016

Broken on CI like this:
http://ci.bazel.io/job/bazel-tests/BAZEL_VERSION=HEAD,PLATFORM_NAME=linux-x86_64/223/testReport/junit/(root)/(empty)/test_consistent_command_line_encoding/

(other platforms too).
Works locally on some machines.
Probably related to #1766

@dslomov dslomov added P1 I'll work on this now. (Assignee required) category: misc > misc labels Sep 14, 2016
@dslomov dslomov changed the title Fix (and reenable) actions_env_test.test_consistent_command_line_encoding Fix (and reenable) run_test.test_consistent_command_line_encoding Sep 14, 2016
@aehlig
Copy link
Contributor

aehlig commented Sep 15, 2016

Works locally on some machines.

Some remarks:

  • At least on our CI, it seems to coincide with the java version (the older
    one, running on ubuntu 14.04, causing the problem).
  • The sequence '\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x84\xC3\x96\xC3\x9C\xC3\x9F'
    gets "re-encoded" as '??????????????'---literally 14 question marks (0x3F)!

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

@aehlig
Copy link
Contributor

aehlig commented Sep 16, 2016

The error happens before RunCommand does anything. printf-debugging with the
following diff

diff --git a/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java b/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
index 26dda27..e3d761f 100644
--- a/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
+++ b/src/main/java/com/google/devtools/build/lib/runtime/commands/RunCommand.java
@@ -14,6 +14,10 @@

 package com.google.devtools.build.lib.runtime.commands;

+import static java.nio.charset.StandardCharsets.ISO_8859_1;
+import static java.nio.charset.StandardCharsets.UTF_8;
+
+
 import com.google.common.annotations.VisibleForTesting;
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.Lists;
@@ -130,6 +134,23 @@ public class RunCommand implements BlazeCommand  {
     // This list should look like: ["//executable:target", "arg1", "arg2"]
     List<String> targetAndArgs = options.getResidue();

+    // DEBUG ONLY
+    for (String arg: targetAndArgs) {
+      System.out.println("XXX");
+      System.out.println(arg);
+      System.out.print("UTF-8 bytes: ");
+      for (byte b : arg.getBytes(UTF_8)) {
+        System.out.print(String.format("%02x ", b));
+      }
+      System.out.println();
+      System.out.print("ISO-8859-1 bytes: ");
+      for (byte b : arg.getBytes(ISO_8859_1)) {
+        System.out.print(String.format("%02x ", b));
+      }
+      System.out.println();
+
+    }
+
     // The user must at the least specify an executable target.
     if (targetAndArgs.isEmpty()) {

Shows that the questionmarks already fall out of the parser (with
the problem probably being earlier); indeed, in the log of the failing
test we find the following.

WARNING: Running Bazel server needs to be killed, because the startup options are different.
XXX
//foo
UTF-8 bytes: 2f 2f 66 6f 6f
ISO-8859-1 bytes: 2f 2f 66 6f 6f
XXX
??????????????
UTF-8 bytes: ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
ISO-8859-1 bytes: 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f
____Loading package: foo
____Loading...

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

@aehlig
Copy link
Contributor

aehlig commented Sep 16, 2016

Further printf-debugging shows that the problem is in the way bazel is started.

  • In src/main/cpp/blaze_util_posix.cc b/src/main/cpp/blaze_util_posix.cc
    argv[n-1] consists of the octets c3 a4 c3 b6 c3 bc c3 84 c3 96 c3 9c c3 9f 00.
    With that argv, we do execv(exe.c_str(), const_cast<char **>(argv));
  • In src/main/java/com/google/devtools/build/lib/bazel/BazelMain.java's
    public static void main(String[] args), the last String of args consists
    of question marks (UTF-8 bytes: ef bf bd ef bf bd ef bf bd ef bf bd ef bf
    bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd ef bf bd
    ef bf bd, ISO-8859-1 bytes: 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f 3f).

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

@aehlig
Copy link
Contributor

aehlig commented Sep 16, 2016

OK, so adding

setenv("LANG", "en_US.UTF-8", 1);
setenv("LC_ALL", "en_US.UTF-8", 1);

in front of the the execv in src/main/cpp/blaze_util_posix.cc (and
including <stdlib.h>) the java invocation gets the arguments without
information loss: the String obtained has as UTF-8 encoding the correct
bytes c3 a4 c3 b6 c3 bc c3 84 c3 96 c3 9c c3 9f. However, when the
the subprocess is called, the string gets rewritten in the default
encoding (which we set to ISO-8859-1) and there it is the octet
sequence e4 f6 fc c4 d6 dc df.

On the other hand, setting the locale to ISO-8859-1 or C, the startup
of java already rewrites the argument to the question marks, so we have
no chance of recovering the original information.

So at the moment, I have no idea on how to fix that encoding mess. On
the other hand, that problem probably has existed since bazel was open
sourced, so probably it is not that urgent.

Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Matthew Scott Sucherman, Paul Terence Manicle

@aehlig aehlig added P2 We'll consider working on this in future. (Assignee optional) and removed P1 I'll work on this now. (Assignee required) labels Sep 17, 2016
@aehlig aehlig added this to the 1.0 milestone Sep 17, 2016
@aehlig aehlig removed their assignment Feb 1, 2020
@meisterT meisterT removed this from the 1.0 milestone May 12, 2020
@meisterT meisterT added area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling and removed category: misc > misc labels May 15, 2020
@jin jin added the type: bug label May 19, 2020
@philwo philwo added the team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website label Jun 15, 2020
@sgowroji
Copy link
Member

Hi there! We're doing a clean up of old issues and will be closing this one. Please reopen if you’d like to discuss anything further. We’ll respond as soon as we have the bandwidth/resources to do so.

@fmeum fmeum reopened this Oct 16, 2024
@fmeum fmeum self-assigned this Oct 16, 2024
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Nov 8, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-EngProd Bazel CI, infrastructure, bootstrapping, release, and distribution tooling P2 We'll consider working on this in future. (Assignee optional) team-OSS Issues for the Bazel OSS team: installation, release processBazel packaging, website type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants