Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support on Windows #1195

Closed
davispuh opened this issue Oct 19, 2016 · 22 comments · Fixed by #1915
Closed

Unicode support on Windows #1195

davispuh opened this issue Oct 19, 2016 · 22 comments · Fixed by #1915
Labels
Milestone

Comments

@davispuh
Copy link

Currently Ninja uses ANSI codepage and ANSI WinAPI functions on Windows. This isn't good because that doesn't allow to use filenames and environment variables outside of ANSI codepage.

Ninja should use Wide WinAPI functions everywhere and then decode/encode them to/from UTF-8 internally.

Also currently ninja rule files are expected to be in ANSI codepage which isn't portable between different Windows machines since they can use different ANSI codepages. Solution would be to use UTF-8 encoding for ninja files. I guess to be backward compatible there will be needed some flag to enable UTF-8 files.

Incomplete example patch

diff --git a/src/subprocess-win32.cc b/src/subprocess-win32.cc
index 4bab719..82ec616 100644
--- a/src/subprocess-win32.cc
+++ b/src/subprocess-win32.cc
@@ -73,8 +73,12 @@ HANDLE Subprocess::SetupPipe(HANDLE ioport) {
 }

 bool Subprocess::Start(SubprocessSet* set, const string& command) {
-  HANDLE child_pipe = SetupPipe(set->ioport_);
+  wstring wideCommand;
+  if (!UTF8ToWide(command.c_str(), wideCommand)) {
+    Fatal("Subprocess::Start: Failed to encode command string to Wide string");
+  }

+  HANDLE child_pipe = SetupPipe(set->ioport_);
   SECURITY_ATTRIBUTES security_attributes;
   memset(&security_attributes, 0, sizeof(SECURITY_ATTRIBUTES));
   security_attributes.nLength = sizeof(SECURITY_ATTRIBUTES);
@@ -86,7 +90,7 @@ bool Subprocess::Start(SubprocessSet* set, const string& command) {
   if (nul == INVALID_HANDLE_VALUE)
     Fatal("couldn't open nul");

-  STARTUPINFOA startup_info;
+  STARTUPINFOW startup_info;
   memset(&startup_info, 0, sizeof(startup_info));
   startup_info.cb = sizeof(STARTUPINFO);
   if (!use_console_) {
@@ -106,7 +110,7 @@ bool Subprocess::Start(SubprocessSet* set, const string& command) {

   // Do not prepend 'cmd /c' on Windows, this breaks command
   // lines greater than 8,191 chars.
-  if (!CreateProcessA(NULL, (char*)command.c_str(), NULL, NULL,
+  if (!CreateProcessW(NULL, (LPWSTR)wideCommand.c_str(), NULL, NULL,
                       /* inherit handles */ TRUE, process_flags,
                       NULL, NULL,
                       &startup_info, &process_info)) {
diff --git a/src/util.cc b/src/util.cc
index e31fd1f..f40b0ab 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -90,6 +90,21 @@ void Error(const char* msg, ...) {
   fprintf(stderr, "\n");
 }

+#ifdef _WIN32
+bool UTF8ToWide(string utf8, wstring &wide) {
+  bool success = false;
+  const int wlength = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), int(utf8.size()), NULL, 0);
+  wchar_t* wdata = new wchar_t[wlength];
+  int r = MultiByteToWideChar(CP_UTF8, 0, utf8.c_str(), int(utf8.size()), wdata, wlength);
+  if (r > 0) {
+    wide = wstring(wdata, wlength);
+    success = true;
+  }
+  delete[] wdata;
+  return success;
+}
+#endif
+
 bool CanonicalizePath(string* path, unsigned int* slash_bits, string* err) {
   METRIC_RECORD("canonicalize str");
   size_t len = path->size();
diff --git a/src/util.h b/src/util.h
index cbdc1a6..da6bbcd 100644
--- a/src/util.h
+++ b/src/util.h
@@ -99,6 +99,9 @@ bool Truncate(const string& path, size_t size, string* err);
 #endif

 #ifdef _WIN32
+/// Encode UTF8 string to Wide string
+bool UTF8ToWide(string utf8, wstring &wide);
+
 /// Convert the value returned by GetLastError() into a string.
 string GetLastErrorString();
zielmicha added a commit to husarion/ninja that referenced this issue Jan 31, 2017
@zielmicha
Copy link

Patch from this issue + the following patch seem to fix build using cmake + ninja in paths that contain non-ascii characters:

diff --git a/src/disk_interface.cc b/src/disk_interface.cc
index 1b4135f..7691650 100644
--- a/src/disk_interface.cc
+++ b/src/disk_interface.cc
@@ -69,8 +69,13 @@ TimeStamp TimeStampFromFileTime(const FILETIME& filetime) {
 }
 
 TimeStamp StatSingleFile(const string& path, string* err) {
+  wstring widePath;
+  if (!UTF8ToWide(path.c_str(), widePath)) {
+    Fatal("cannot encode path");
+  }
+
   WIN32_FILE_ATTRIBUTE_DATA attrs;
-  if (!GetFileAttributesEx(path.c_str(), GetFileExInfoStandard, &attrs)) {
+  if (!GetFileAttributesExW((LPWSTR)widePath.c_str(), GetFileExInfoStandard, &attrs)) {
     DWORD win_err = GetLastError();
     if (win_err == ERROR_FILE_NOT_FOUND || win_err == ERROR_PATH_NOT_FOUND)
       return 0;
@@ -250,8 +255,10 @@ int RealDiskInterface::RemoveFile(const string& path) {
 
 void RealDiskInterface::AllowStatCache(bool allow) {
 #ifdef _WIN32
+#if 0
   use_cache_ = allow;
   if (!use_cache_)
     cache_.clear();
 #endif
+#endif
 }

@melak47
Copy link

melak47 commented Mar 31, 2017

@zielmicha Can I ask why you've disabled the stat caching in your patch?

@zielmicha
Copy link

It was simpler that way - otherwise I would have to modify the code that populates stat cache. And in my case, degraded performance didn't matter.

@melak47
Copy link

melak47 commented Apr 1, 2017

Maybe I'm missing something, but to me it looks like that code should work fine with UTF-8 filenames without changes.

I've tried to locate every place where paths or environment variables are concerned and changed them here: https://github.com/melak47/ninja

So far it appears to work well:

  • non-ANSI command-line arguments work (e.g. ninja -C ninja👍),
  • unicode in build.ninja works e.g. builddir=build👍, build $builddir\build👍.obj: cxx $root\src\build👍.cc

I haven't tested utf-8 encoded env block files yet.
Also, I haven't tried to change the printing to use wide strings, so unicode strings printed to the console will look like garbage.

@davispuh
Copy link
Author

davispuh commented Apr 1, 2017

Great work! 👍 with quick look it seems fine, but I don't really like this solution. I think it would be better to implement abstractions over fopen and such and use std:string instead and then that abstraction would take care of converting std:string to whichever encoding is needed for platform but internally would use UTF-8 everywhere.

@melak47
Copy link

melak47 commented Apr 1, 2017

I added these wrappers here

https://github.com/melak47/ninja/commit/f5973605d2fe002b85b5d3b5f19008bac14b5497#diff-772f489c7d0a32de3badbfbcb5fd200d

which do what you suggest. Since almost everywhere, the paths are already std::strings, I suppose there would be no harm (unnecessary conversions/temporaries) if I changed them to take const std::string&s instead.

Is that what you meant, or what kind of abstractions did you have in mind?

@davispuh
Copy link
Author

davispuh commented Apr 1, 2017

Yeah exactly, not char *, also it might be better to use WinAPI directly (CreateFileW/ReadFileW) instead of _wfopen but that's a bit more work.

@melak47
Copy link

melak47 commented Apr 2, 2017

Hmm, capturing the output of child processes as unicode is a bit tricky,
and requires changing the code page to CP_UTF8 momentarily.

I started with the msvc tool because that is windows-only code, and so far that seems to work:

console_output

It's displays more or less correctly on the console, and when redirected to files/pipes is plain UTF-8.

Additionally, if the output of ninja is displayed in an IDE, or a console wrapper like ConEmu, all of it should render correctly:
output

@davispuh
Copy link
Author

davispuh commented Apr 2, 2017

Well it's complicated... Only caller actually knows which encoding will be used by child so you can't really hardcode it. But most of time applications output in current console's codepage and that's their side and we can't do anything about it so we just have to support several encodings. Basicallly idea is that you've to decode from child process encoding to internally used encoding.

I actually implemented large part of Unicode support in CMake which you can take as an example, but it's really shame we've to reimplement it here again...

Take a look at

That cmProcessOutput implementation supports all Windows codepages and it is stream interface so you can decode as soon as you get data and don't have to wait for whole output. This isn't that simple for multi-byte encodings...

@melak47
Copy link

melak47 commented Apr 2, 2017

Doesn't that lead to lossy conversions, though?
For example, with the code page set to 850, if I CreateProcess("cl -nologo -c foo_bar_ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ_Интернета😀.cpp"), reading the output from the pipe yields "foo_bar_???????????_???????????.cpp" - knowing what code page was used to encode it doesn't help us decode the characters that were replaced because they can't be represented.

But most of time applications output in current console's codepage and that's their side and we can't do anything about it so we just have to support several encodings.

Isn't the problem the other way around? We can control what the current code page is (using e.g. SetConsoleOutputCP(CP_UTF8)). However, not all applications will respect that when their output is redirected to a file, and I don't see a way to detect when this happens (can't just search for ?s in the output), so you wouldn't know if you need to try decoding with some ASNI code page.

On my fork I modified ninja so it changes to CP_UTF8 / 65001 (plus restoring the previous code page on exit), and it works for cl.exe for example. link.exe on the other hand doesn't seem to respect it when redirected to a pipe/file, and replaces characters anyway - and I don't think there's anything we could do about this.

@davispuh
Copy link
Author

davispuh commented Apr 3, 2017

There in CMake we just use whatever code page was set by user, it's not our responsibility to change that because it's global for whole Console process and if application crashes before it restores it will remain changed. User himself can set console to UTF-8 or to whichever he wants. Especially because some fonts don't support characters from console codepage so you would have to change font too. Otherwise user won't be able to see that output anyway.

For example, with the code page set to 850, if I CreateProcess("cl -nologo -c foo_bar_ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ_Интернета😀.cpp"), reading the output from the pipe yields "foo_bar_???????????_???????????.cpp" - knowing what code page was used to encode it doesn't help us decode the characters that were replaced because they can't be represented.

That is correct, but it's not on us, encoding which cl uses for pipes isn't our responsibility, we just take it as it is and decode from that encoding to our internal encoding. And if we use UTF-8 as internal encoding it's not lossy because any codepage decodes to UTF-8 without any loss. cl is application which looses that information not us.

However, not all applications will respect that when their output is redirected to a file, and I don't see a way to detect when this happens (can't just search for ?s in the output), so you wouldn't know if you need to try decoding with some ASNI code page.

Encoding which will be used by application depends on that application itself, it's not worth trying to guess it. But default should cover most of cases so it would be best like it's in CMake (see cmProcessOutput) Auto (based on console's codepage), None, ANSI, OEM. Also note that same application can use 2 different encodings, one for redirected to file and other for pipe.

You can detect if output is file, console or pipe with GetFileType look at BasicConsoleBuf::setActiveInputCodepage

On my fork I modified ninja so it changes to CP_UTF8 / 65001 (plus restoring the previous code page on exit), and it works for cl.exe for example. link.exe on the other hand doesn't seem to respect it when redirected to a pipe/file, and replaces characters anyway - and I don't think there's anything we could do about this.

We can solve it exactly how that is in CMake, caller specifies encoding which to use for decoding.

@melak47
Copy link

melak47 commented Apr 4, 2017

cl is application which looses that information not us.

True, but that doesn't make the end result for the user any better.

User himself can set console to UTF-8 or to whichever he wants.

I suppose.
Having the build work is probably marginally more important than lossless capture of build output. Unicode filenames and build output are probably both niche problems anyway, mingw gcc and probably many other tools on windows don't support it.

We can solve it exactly how that is in CMake, caller specifies encoding which to use for decoding.

I don't think it needs to be quite as complicated, since ninja waits for the subprocess to finish before doing anything with the output.

You can detect if output is file, console or pipe with GetFileType

If I read this right:

So do we need to distinguish between file and pipe after all?

Ninja would assume child processes write to the pipe using the current console code page unless overridden by the user?
I'm not sure if this override should be a commandline option, or something local to a particular build statement (or implemented as a wrapper like the msvc tool).

@davispuh
Copy link
Author

davispuh commented Apr 4, 2017

I don't think it needs to be quite as complicated, since ninja waits for the subprocess to finish before doing anything with the output.

but you still can't decode it if you don't know which encoding was used by child process.

If I read this right:

yeah, exactly but this is CMake specific, so while CMake does it this way, it doesn't mean other applications will do same.

So do we need to distinguish between file and pipe after all?

For current CMake implementation no, but that's only for cmake, ctest, cpack.

Ninja would assume child processes write to the pipe using the current console code page unless overridden by the user?
I'm not sure if this override should be a commandline option, or something local to a particular build statement (or implemented as a wrapper like the msvc tool).

It needs to be specified per process not global for all processes since different processes might use different encoding.

With quick look it seems would need to add another parameter to Subprocess::Start which takes encoding (could be alias like Auto, UTF8 and such) and then in Subprocess::OnPipeReady you use that encoding to decode output which you then append to buf_. Also same encoding parameter would need to be added to SubprocessSet::Add which would set it for Start and then would need to add encoding binding like Edge::GetCommandEncoding which RealCommandRunner::StartCommand could use to pass it to SubprocessSet::Add.

something like this

diff --git a/src/subprocess-win32.cc b/src/subprocess-win32.cc
index 4bab719..ed02e86 100644
--- a/src/subprocess-win32.cc
+++ b/src/subprocess-win32.cc
@@ -72,7 +72,8 @@ HANDLE Subprocess::SetupPipe(HANDLE ioport) {
   return output_write_child;
 }

-bool Subprocess::Start(SubprocessSet* set, const string& command) {
+bool Subprocess::Start(SubprocessSet* set, const string& command, const Encoding enc = Encoding::Auto) {
+  encoding = enc;
   HANDLE child_pipe = SetupPipe(set->ioport_);

   SECURITY_ATTRIBUTES security_attributes;
@@ -151,7 +152,10 @@ void Subprocess::OnPipeReady() {
   }

   if (is_reading_ && bytes)
-    buf_.append(overlapped_buf_, bytes);
+  {
+    std::string decoded = DecodeData(overlapped_buf_, bytes, encoding);
+    buf_.append(decoded);
+  }

   memset(&overlapped_, 0, sizeof(overlapped_));
   is_reading_ = true;
@@ -223,9 +227,9 @@ BOOL WINAPI SubprocessSet::NotifyInterrupted(DWORD dwCtrlType) {
   return FALSE;
 }

-Subprocess *SubprocessSet::Add(const string& command, bool use_console) {
+Subprocess *SubprocessSet::Add(const string& command, bool use_console, Encoding enc = Encoding::Auto) {
   Subprocess *subprocess = new Subprocess(use_console);
-  if (!subprocess->Start(this, command)) {
+  if (!subprocess->Start(this, command, enc)) {
     delete subprocess;
     return 0;
   }
diff --git a/src/build.cc b/src/build.cc
index a0c7ec8..cf85606 100644
--- a/src/build.cc
+++ b/src/build.cc
@@ -537,7 +537,8 @@ bool RealCommandRunner::CanRunMore() {

 bool RealCommandRunner::StartCommand(Edge* edge) {
   string command = edge->EvaluateCommand();
-  Subprocess* subproc = subprocs_.Add(command, edge->use_console());
+  auto encoding = edge->GetCommandEncoding();
+  Subprocess* subproc = subprocs_.Add(command, edge->use_console(), encoding);
   if (!subproc)
     return false;
   subproc_to_edge_.insert(make_pair(subproc, edge));

@melak47
Copy link

melak47 commented Apr 4, 2017

but you still can't decode it if you don't know which encoding was used by child process.

I meant we don't necessarily have to deal with the complexity of stream-decoding multibyte encodings,
we could just decode the whole collected output in Subprocess::Finish, for example.

yeah, exactly but this is CMake specific, so while CMake does it this way, it doesn't mean other applications will do same.

I just wanted to confirm this is the behaviour you would implement in ninja (not what ninja should assume about other processes).

It needs to be specified per process not global for all processes since different processes might use different encoding.

While doing some more testing, I found just such a process. Clang on windows appears to always output in UTF-8, regardless of code page :)

With quick look it seems would need to add another parameter to Subprocess::Start which takes encoding ...

Thanks for pointing me in the right direction. Adding another special variable to rules/build statements was easier than I thought.

One wrinkle remains, though.
With the main ninja doing all the de/encoding, ninja -t msvc -- cl ... subprocesses should just forward the output as is.
However, when the msvc tool is run with the -o FILE option, it runs the CLParser to filter the output, but that should operate on decoded output. We don't know the user provided encoding is at this point, though.
Adding another command-line flag to the msvc tool to also specify the encoding seems a bit clumsy, but the information would have to be conveyed somehow.

@davispuh
Copy link
Author

davispuh commented Apr 4, 2017

With the main ninja doing all the de/encoding, ninja -t msvc -- cl ... subprocesses should just forward the output as is.
However, when the msvc tool is run with the -o FILE option, it runs the CLParser to filter the output, but that should operate on decoded output. We don't know the user provided encoding is at this point, though.
Adding another command-line flag to the msvc tool to also specify the encoding seems a bit clumsy, but the information would have to be conveyed somehow.

I don't really know much about that part, but it seems there's no other way than requiring encoding parameter because it can invoke any other application but 99% of time it will be cl so default can be encoding for that but if someone runs something else he would just do nina -t msvc -e utf8 -- myapp

Note that it doesn't use subprocess msvc_helper-win32.cc

@amckinlay
Copy link

Windows implements WTF-16 not UTF-16 for wide filesystem APIs. See https://simonsapin.github.io/wtf-8/. Who knows where else they butcher UTF-16 in their wide APIs.

@jgoshi
Copy link

jgoshi commented Jul 31, 2018

@melak47 I was wondering if you tried to get this into the main branch? Are there reasons why it hasn't made it?

@MrSapps
Copy link

MrSapps commented Oct 7, 2018

Given the main reason to use Ninja is performance what sort of impact does this conversion have?

@amckinlay
Copy link

amckinlay commented Oct 8, 2018

@paulsapps Correctness is way more important than performance w.r.t. encoding. Anyway, the system calls themselves should have more overhead than Unicode conversion.

@jhasse jhasse added the windows label Oct 30, 2018
@jlonnberg
Copy link

@jgoshi I've just added a pull request for a unicode version of ninja, but haven't recieved any feedback on it yet.

@jgoshi
Copy link

jgoshi commented Apr 30, 2019

Thank you for putting that together. I hope it will be accepted.

@tristanlabelle
Copy link

Tristan from the Microsoft Visual C++ Developer Experience team here. We're seeing some issues related to Unicode support in scenarios using Ninja through our CMake support. The issues are either around build failures (Ninja being unable to perform the build), or output returned as garbled characters.

I've spent some time looking into the Ninja codebase to understand why, and below are the issues I've found. I hope it is of some use to have them documented here:

  • Ninja build files are ANSI, and so they cannot represent paths containing Unicode characters. This causes Ninja to be unable to find the files specified in its build config files and to fail.
  • Ninja uses ANSI Win32 functions, so it cannot reference a file whose path contains Unicode characters (this is what this issue is focused on)
  • Ninja reads build files and process output as binary, encoding-unaware, and then combines them in its own output (for example, printing a file path from its build files to the same stdout where it prints messages coming from child processes). The encoding for both of these can be different, resulting in a stdout stream that contains mixed encodings, which make it impossible to properly decode upstream

jhasse added a commit to jhasse/ninja that referenced this issue Feb 17, 2021
Allows Ninja to use descriptions, filenames and environment variables
with characters outside of the ANSI codepage on Windows. Build manifests
are now UTF-8 by default (this change needs to be emphasized in the
release notes).

WriteConsoleOutput doesn't support UTF-8, but it's deprecated on newer
Windows 10 versions anyway (or as Microsoft likes to put it: "no longer
a part of our ecosystem roadmap"). We'll use the VT100 sequence just as
we do on Linux and macOS.

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
https://docs.microsoft.com/en-us/windows/console/writeconsoleoutput
https://docs.microsoft.com/de-de/windows/console/console-virtual-terminal-sequences
jhasse added a commit that referenced this issue Feb 23, 2021
Use UTF-8 on Windows 10 Version 1903, fix #1195
bradking added a commit to bradking/ninja that referenced this issue Feb 26, 2021
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195,
2021-02-17), `ninja` does not always expect `build.ninja` to be encoded
in the system's ANSI code page.  The expected encoding now depends on
how `ninja` is built and the version of Windows on which it is running.

Add a `-t wincodepage` tool that generators can use to ask `ninja`
what encoding it expects.

Issue: ninja-build#1195
bradking added a commit to bradking/ninja that referenced this issue Feb 26, 2021
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195,
2021-02-17), `ninja` does not always expect `build.ninja` to be encoded
in the system's ANSI code page.  The expected encoding now depends on
how `ninja` is built and the version of Windows on which it is running.

Add a `-t wincodepage` tool that generators can use to ask `ninja`
what encoding it expects.

Issue: ninja-build#1195
bradking added a commit to bradking/ninja that referenced this issue Feb 26, 2021
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195,
2021-02-17), `ninja` does not always expect `build.ninja` to be encoded
in the system's ANSI code page.  The expected encoding now depends on
how `ninja` is built and the version of Windows on which it is running.

Add a `-t wincodepage` tool that generators can use to ask `ninja`
what encoding it expects.

Issue: ninja-build#1195
@jhasse jhasse added this to the 1.11.0 milestone Mar 19, 2021
rascani pushed a commit to rascani/ninja that referenced this issue Apr 29, 2021
Allows Ninja to use descriptions, filenames and environment variables
with characters outside of the ANSI codepage on Windows. Build manifests
are now UTF-8 by default (this change needs to be emphasized in the
release notes).

WriteConsoleOutput doesn't support UTF-8, but it's deprecated on newer
Windows 10 versions anyway (or as Microsoft likes to put it: "no longer
a part of our ecosystem roadmap"). We'll use the VT100 sequence just as
we do on Linux and macOS.

https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page
https://docs.microsoft.com/en-us/windows/console/writeconsoleoutput
https://docs.microsoft.com/de-de/windows/console/console-virtual-terminal-sequences
rascani pushed a commit to rascani/ninja that referenced this issue Apr 29, 2021
Since commit 00459e2 (Use UTF-8 on Windows 10 Version 1903, fix ninja-build#1195,
2021-02-17), `ninja` does not always expect `build.ninja` to be encoded
in the system's ANSI code page.  The expected encoding now depends on
how `ninja` is built and the version of Windows on which it is running.

Add a `-t wincodepage` tool that generators can use to ask `ninja`
what encoding it expects.

Issue: ninja-build#1195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants