Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use Bazel in a folder that has Japanese characters in it #23859

Open
PSmithsonp4 opened this issue Oct 3, 2024 · 0 comments
Open
Assignees
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged

Comments

@PSmithsonp4
Copy link

PSmithsonp4 commented Oct 3, 2024

Description of the bug:

If you work in a folder that has Japanese characters in it, running "bazel build" will fail (even "bazel info release" fails). Here's what happens when I try it on my Mint 21 install -

/tmp/ワーク:$ bazel build //...
Starting local Bazel server and connecting to it...
ERROR: Client cwd '/tmp/ワーク' is not inside workspace '/tmp/???'
/tmp/ワーク:$ 

That's an English installation of Mint 21. On Windows, I have a JP VM. i.e. the OS is in Japanese. I get a slightly different error -

C:\tmp\ウェブ>bazel build //...
FATAL: changing directory into c:\tmp\ウェブ failed: (error: 123): t@CAfBNgA܂̓{[ x̍\Ԉ��Ă܂B


C:\tmp\ウェブ>

The contents of the BUILD and WORKSPACE file don't matter (as far as I can tell).

Which category does this issue belong to?

Core

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

On Linux, you could type -

mkdir /tmp/ワーク
cd /tmp/ワーク
touch BUILD
touch WORKSPACE
bazel build //...

Which operating system are you running Bazel on?

Mint 21, Rocky 9 and Windows 10

What is the output of bazel info release?

/tmp/ワーク:$ bazel info release ERROR: Client cwd '/tmp/ワーク' is not inside workspace '/tmp/???'

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

I just installed release versions

What's the output of git remote get-url origin; git rev-parse HEAD ?

No response

If this is a regression, please try to identify the Bazel commit where the bug was introduced with bazelisk --bisect.

No response

Have you found anything relevant by searching the web?

I posted a question here - https://stackoverflow.com/questions/79047697/bazel-build-not-working-in-a-folder-with-non-ascii-japanese-characters
I found a similar issue reported here - #2550

Any other information, logs, or outputs that you want to share?

Since I can't get the "bazel info release" output, I can tell you that it's 7.1.0 on Linux and 7.3.2 on Windows. That's what --version shows. Same if I run bazel info release in an ASCII folder.

@github-actions github-actions bot added the team-Core Skyframe, bazel query, BEP, options parsing, bazelrc label Oct 3, 2024
@fmeum fmeum self-assigned this Oct 16, 2024
copybara-service bot pushed a commit that referenced this issue Nov 5, 2024
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards #374
Work towards #18293
Work towards #23859

Closes #24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
bazel-io pushed a commit to bazel-io/bazel that referenced this issue Nov 6, 2024
This change patches the app manifest of the `java.exe` launcher in the embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the system code page, which by default is a legacy code page such as Cp1252 on Windows. This causes the JVM to be unable to interact with files whose paths contain Unicode characters not representable in the system code page, as well as command-line arguments and environment variables containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this change can currently only be tested locally by running `bazel info character-encoding` and verifying that it prints `sun.jnu.encoding = UTF-8`.

Work towards bazelbuild#374
Work towards bazelbuild#18293
Work towards bazelbuild#23859

Closes bazelbuild#24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366
github-merge-queue bot pushed a commit that referenced this issue Nov 7, 2024
This change patches the app manifest of the `java.exe` launcher in the
embedded JDK to always use the UTF-8 codepage on Windows 1903 and later.

This is necessary because the launcher sets sun.jnu.encoding to the
system code page, which by default is a legacy code page such as Cp1252
on Windows. This causes the JVM to be unable to interact with files
whose paths contain Unicode characters not representable in the system
code page, as well as command-line arguments and environment variables
containing such characters.

The Windows VMs in CI are not running Windows 1903 or later yet, so this
change can currently only be tested locally by running `bazel info
character-encoding` and verifying that it prints `sun.jnu.encoding =
UTF-8`.

Work towards #374
Work towards #18293
Work towards #23859

Closes #24172.

PiperOrigin-RevId: 693466466
Change-Id: I4914c21e846493a8880ac8c6f5e1afa9fae87366

Commit
7bb8d2b

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
copybara-service bot pushed a commit that referenced this issue Nov 7, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing #1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes #1775.

Fixes #11602.

Fixes #18293.

Work towards #374.

Work towards #23859.

Closes #24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
iancha1992 pushed a commit to iancha1992/bazel that referenced this issue Nov 8, 2024
Bazel aims to support arbitrary file system path encodings (even raw byte sequences) by attempting to force the JVM to use a Latin-1 locale for OS interactions. As a result, Bazel internally encodes `String`s as raw byte arrays with a Latin-1 coder and no encoding information. Whenever it interacts with encoding-aware APIs, this may require a reencoding of the `String` contents, depending on the OS and availability of a Latin-1 locale.

This PR introduces the concepts of *internal*, *Unicode*, and *platform* strings and adds dedicated optimized functions for converting between these three types (see the class comment on the new `StringEncoding` helper class for details). These functions are then used to standardize and fix conversion throughout the code base. As a result, a number of new end-to-end integration tests for the handling of Unicode in file paths, command-line arguments and environment variables now pass.

Full support for Unicode beyond the current active code page on Windows is left to a follow-up PR as it may require patching the embedded JDK.

* Replace ad-hoc conversion logic with the new consistent set of helper functions.
* Make more parts of the Bazel client's Windows implementation Unicode-aware. This also fixes the behavior of `SetEnv` on Windows, which previously would remove an environment variable if passed an empty value for it, which doesn't match the Unix behavior.
* Drop the `charset` parameter from all methods related to parameter files. The `ISO-8859-1` vs. `UTF-8` choice was flawed since Bazel's internal string representation doesn't maintain any encoding information - `ISO-8859-1` just meant "write out raw bytes", which is the only choice that matches what arguments would look like if passed on the command line.
* Convert server args to the internal string representation. The arguments for requests to the server were already converted to Bazel's internal string representation, which resulted in a mismatch between `--client_cwd` and `--workspace_directory` if the workspace path contains non-ASCII characters.
* Read the downloader config using Bazel's filesystem implementation.
* Make `MacOSXFsEventsDiffAwareness` UTF-8 aware. It previously used the `GetStringUTF` JNI method, which, despite its name, doesn't return the UTF-8 representation of a string, but modified CESU-8 (nobody ever wants this).
* Correctly reencode path strings for `LocalDiffAwareness`.
* Correctly reencode the value of `user.dir`.
* Correctly turn `ExecRequest` fields into strings for `ProcessBuilder` for `bazel --batch run`. This makes it possible to reenable the `test_consistent_command_line_encoding` test, fixing bazelbuild#1775.
* Fix encoding issues in `TargetCompleteEvents`.
* Fix encoding issues in `SubprocessFactory` implementations.
* Drop obsolete warning if `file.encoding` doesn't equal `ISO-8859-1` as file names are encoded with `sun.jnu.encoding` now.
* Consistently reencode internal strings passed into and out of `FileSystem` implementations, e.g. if reading a symlink target. Tests are added that verify the interaction between `FileSystem` implementations and the Java (N)IO APIs on Unicode file paths.

Fixes bazelbuild#1775.

Fixes bazelbuild#11602.

Fixes bazelbuild#18293.

Work towards #374.

Work towards bazelbuild#23859.

Closes bazelbuild#24010.

PiperOrigin-RevId: 694114597
Change-Id: I5bdcbc14a90dd1f0f34698aebcbd07cd2bde7a23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Core Skyframe, bazel query, BEP, options parsing, bazelrc type: bug untriaged
Projects
None yet
Development

No branches or pull requests

5 participants