-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode support on Windows #1195
Comments
Patch from this issue + the following patch seem to fix build using cmake + ninja in paths that contain non-ascii characters: diff --git a/src/disk_interface.cc b/src/disk_interface.cc
index 1b4135f..7691650 100644
--- a/src/disk_interface.cc
+++ b/src/disk_interface.cc
@@ -69,8 +69,13 @@ TimeStamp TimeStampFromFileTime(const FILETIME& filetime) {
}
TimeStamp StatSingleFile(const string& path, string* err) {
+ wstring widePath;
+ if (!UTF8ToWide(path.c_str(), widePath)) {
+ Fatal("cannot encode path");
+ }
+
WIN32_FILE_ATTRIBUTE_DATA attrs;
- if (!GetFileAttributesEx(path.c_str(), GetFileExInfoStandard, &attrs)) {
+ if (!GetFileAttributesExW((LPWSTR)widePath.c_str(), GetFileExInfoStandard, &attrs)) {
DWORD win_err = GetLastError();
if (win_err == ERROR_FILE_NOT_FOUND || win_err == ERROR_PATH_NOT_FOUND)
return 0;
@@ -250,8 +255,10 @@ int RealDiskInterface::RemoveFile(const string& path) {
void RealDiskInterface::AllowStatCache(bool allow) {
#ifdef _WIN32
+#if 0
use_cache_ = allow;
if (!use_cache_)
cache_.clear();
#endif
+#endif
} |
@zielmicha Can I ask why you've disabled the stat caching in your patch? |
It was simpler that way - otherwise I would have to modify the code that populates stat cache. And in my case, degraded performance didn't matter. |
Maybe I'm missing something, but to me it looks like that code should work fine with UTF-8 filenames without changes. I've tried to locate every place where paths or environment variables are concerned and changed them here: https://github.com/melak47/ninja So far it appears to work well:
I haven't tested utf-8 encoded env block files yet. |
Great work! 👍 with quick look it seems fine, but I don't really like this solution. I think it would be better to implement abstractions over |
I added these wrappers here which do what you suggest. Since almost everywhere, the paths are already Is that what you meant, or what kind of abstractions did you have in mind? |
Yeah exactly, not |
Well it's complicated... Only caller actually knows which encoding will be used by child so you can't really hardcode it. But most of time applications output in current console's codepage and that's their side and we can't do anything about it so we just have to support several encodings. Basicallly idea is that you've to decode from child process encoding to internally used encoding. I actually implemented large part of Unicode support in CMake which you can take as an example, but it's really shame we've to reimplement it here again... Take a look at That cmProcessOutput implementation supports all Windows codepages and it is stream interface so you can decode as soon as you get data and don't have to wait for whole output. This isn't that simple for multi-byte encodings... |
Doesn't that lead to lossy conversions, though?
Isn't the problem the other way around? We can control what the current code page is (using e.g. On my fork I modified ninja so it changes to CP_UTF8 / 65001 (plus restoring the previous code page on exit), and it works for |
There in CMake we just use whatever code page was set by user, it's not our responsibility to change that because it's global for whole Console process and if application crashes before it restores it will remain changed. User himself can set console to UTF-8 or to whichever he wants. Especially because some fonts don't support characters from console codepage so you would have to change font too. Otherwise user won't be able to see that output anyway.
That is correct, but it's not on us, encoding which
Encoding which will be used by application depends on that application itself, it's not worth trying to guess it. But default should cover most of cases so it would be best like it's in CMake (see cmProcessOutput) You can detect if output is file, console or pipe with
We can solve it exactly how that is in CMake, caller specifies encoding which to use for decoding. |
True, but that doesn't make the end result for the user any better.
I suppose.
I don't think it needs to be quite as complicated, since ninja waits for the subprocess to finish before doing anything with the output.
If I read this right:
So do we need to distinguish between file and pipe after all? Ninja would assume child processes write to the pipe using the current console code page unless overridden by the user? |
but you still can't decode it if you don't know which encoding was used by child process.
yeah, exactly but this is CMake specific, so while CMake does it this way, it doesn't mean other applications will do same.
For current CMake implementation no, but that's only for
It needs to be specified per process not global for all processes since different processes might use different encoding. With quick look it seems would need to add another parameter to something like this diff --git a/src/subprocess-win32.cc b/src/subprocess-win32.cc
index 4bab719..ed02e86 100644
--- a/src/subprocess-win32.cc
+++ b/src/subprocess-win32.cc
@@ -72,7 +72,8 @@ HANDLE Subprocess::SetupPipe(HANDLE ioport) {
return output_write_child;
}
-bool Subprocess::Start(SubprocessSet* set, const string& command) {
+bool Subprocess::Start(SubprocessSet* set, const string& command, const Encoding enc = Encoding::Auto) {
+ encoding = enc;
HANDLE child_pipe = SetupPipe(set->ioport_);
SECURITY_ATTRIBUTES security_attributes;
@@ -151,7 +152,10 @@ void Subprocess::OnPipeReady() {
}
if (is_reading_ && bytes)
- buf_.append(overlapped_buf_, bytes);
+ {
+ std::string decoded = DecodeData(overlapped_buf_, bytes, encoding);
+ buf_.append(decoded);
+ }
memset(&overlapped_, 0, sizeof(overlapped_));
is_reading_ = true;
@@ -223,9 +227,9 @@ BOOL WINAPI SubprocessSet::NotifyInterrupted(DWORD dwCtrlType) {
return FALSE;
}
-Subprocess *SubprocessSet::Add(const string& command, bool use_console) {
+Subprocess *SubprocessSet::Add(const string& command, bool use_console, Encoding enc = Encoding::Auto) {
Subprocess *subprocess = new Subprocess(use_console);
- if (!subprocess->Start(this, command)) {
+ if (!subprocess->Start(this, command, enc)) {
delete subprocess;
return 0;
}
diff --git a/src/build.cc b/src/build.cc
index a0c7ec8..cf85606 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -537,7 +537,8 @@ bool RealCommandRunner::CanRunMore() {
bool RealCommandRunner::StartCommand(Edge* edge) {
string command = edge->EvaluateCommand();
- Subprocess* subproc = subprocs_.Add(command, edge->use_console());
+ auto encoding = edge->GetCommandEncoding();
+ Subprocess* subproc = subprocs_.Add(command, edge->use_console(), encoding);
if (!subproc)
return false;
subproc_to_edge_.insert(make_pair(subproc, edge)); |
I meant we don't necessarily have to deal with the complexity of stream-decoding multibyte encodings,
I just wanted to confirm this is the behaviour you would implement in ninja (not what ninja should assume about other processes).
While doing some more testing, I found just such a process. Clang on windows appears to always output in UTF-8, regardless of code page :)
Thanks for pointing me in the right direction. Adding another special variable to rules/build statements was easier than I thought. One wrinkle remains, though. |
I don't really know much about that part, but it seems there's no other way than requiring encoding parameter because it can invoke any other application but 99% of time it will be Note that it doesn't use subprocess msvc_helper-win32.cc |
Windows implements WTF-16 not UTF-16 for wide filesystem APIs. See https://simonsapin.github.io/wtf-8/. Who knows where else they butcher UTF-16 in their wide APIs. |
@melak47 I was wondering if you tried to get this into the main branch? Are there reasons why it hasn't made it? |
Given the main reason to use Ninja is performance what sort of impact does this conversion have? |
@paulsapps Correctness is way more important than performance w.r.t. encoding. Anyway, the system calls themselves should have more overhead than Unicode conversion. |
@jgoshi I've just added a pull request for a unicode version of ninja, but haven't recieved any feedback on it yet. |
Thank you for putting that together. I hope it will be accepted. |
Tristan from the Microsoft Visual C++ Developer Experience team here. We're seeing some issues related to Unicode support in scenarios using Ninja through our CMake support. The issues are either around build failures (Ninja being unable to perform the build), or output returned as garbled characters. I've spent some time looking into the Ninja codebase to understand why, and below are the issues I've found. I hope it is of some use to have them documented here:
|
Allows Ninja to use descriptions, filenames and environment variables with characters outside of the ANSI codepage on Windows. Build manifests are now UTF-8 by default (this change needs to be emphasized in the release notes). WriteConsoleOutput doesn't support UTF-8, but it's deprecated on newer Windows 10 versions anyway (or as Microsoft likes to put it: "no longer a part of our ecosystem roadmap"). We'll use the VT100 sequence just as we do on Linux and macOS. https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page https://docs.microsoft.com/en-us/windows/console/writeconsoleoutput https://docs.microsoft.com/de-de/windows/console/console-virtual-terminal-sequences
Use UTF-8 on Windows 10 Version 1903, fix #1195
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195, 2021-02-17), `ninja` does not always expect `build.ninja` to be encoded in the system's ANSI code page. The expected encoding now depends on how `ninja` is built and the version of Windows on which it is running. Add a `-t wincodepage` tool that generators can use to ask `ninja` what encoding it expects. Issue: ninja-build#1195
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195, 2021-02-17), `ninja` does not always expect `build.ninja` to be encoded in the system's ANSI code page. The expected encoding now depends on how `ninja` is built and the version of Windows on which it is running. Add a `-t wincodepage` tool that generators can use to ask `ninja` what encoding it expects. Issue: ninja-build#1195
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195, 2021-02-17), `ninja` does not always expect `build.ninja` to be encoded in the system's ANSI code page. The expected encoding now depends on how `ninja` is built and the version of Windows on which it is running. Add a `-t wincodepage` tool that generators can use to ask `ninja` what encoding it expects. Issue: ninja-build#1195
Allows Ninja to use descriptions, filenames and environment variables with characters outside of the ANSI codepage on Windows. Build manifests are now UTF-8 by default (this change needs to be emphasized in the release notes). WriteConsoleOutput doesn't support UTF-8, but it's deprecated on newer Windows 10 versions anyway (or as Microsoft likes to put it: "no longer a part of our ecosystem roadmap"). We'll use the VT100 sequence just as we do on Linux and macOS. https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page https://docs.microsoft.com/en-us/windows/console/writeconsoleoutput https://docs.microsoft.com/de-de/windows/console/console-virtual-terminal-sequences
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195, 2021-02-17), `ninja` does not always expect `build.ninja` to be encoded in the system's ANSI code page. The expected encoding now depends on how `ninja` is built and the version of Windows on which it is running. Add a `-t wincodepage` tool that generators can use to ask `ninja` what encoding it expects. Issue: ninja-build#1195
Currently Ninja uses ANSI codepage and ANSI WinAPI functions on Windows. This isn't good because that doesn't allow to use filenames and environment variables outside of ANSI codepage.
Ninja should use Wide WinAPI functions everywhere and then decode/encode them to/from UTF-8 internally.
Also currently
ninja
rule files are expected to be in ANSI codepage which isn't portable between different Windows machines since they can use different ANSI codepages. Solution would be to use UTF-8 encoding for ninja files. I guess to be backward compatible there will be needed some flag to enable UTF-8 files.Incomplete example patch
The text was updated successfully, but these errors were encountered: