Skip to content

Fix some UTF-8 issues on Windows. #9812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 20, 2013
Merged

Fix some UTF-8 issues on Windows. #9812

merged 1 commit into from
Oct 20, 2013

Conversation

nitric1
Copy link
Contributor

@nitric1 nitric1 commented Oct 11, 2013

This fixes #9418 and #9618, and potential problems related to directory walking.

@@ -54,6 +54,12 @@
#include <assert.h>

#if defined(__WIN32__)
#ifndef WIN32_LEAN_AND_MEAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments explaining what these constants are and why they're defined?

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 12, 2013

Sorry for my short description. Some comments added.

On Windows, most of char version of functions treat strings as ANSI, not UTF-8, so code using them should be changed to use wchar_t version (like wcsftime, _wstat, ....)

Related issue: #9822

Closes #9418
Closes #9618

@alexcrichton
Copy link
Member

Hm it's unfortunate that this probably means that most of our usage of windows apis are "broken" in the sense that we're not dealing with unicode strings properly. This seems like some that's tricky to get right, so I'm wary of adding this sort of functionality without an ability to prevent it regressing. Could you also add some tests with unicode filenames and such to make sure that these apis properly work? The time zone problem may be tough to test, but it would be great if we were able to get it testable as well.

@alexcrichton
Copy link
Member

Also I'm still a bit curious as to why we need to use all the wide versions everywhere. If one still take char* to be a "null-terminated string" then most of utf8 still fits into that description. So when you're calling something like stftime with a unicode locale, what happens? Does windows assert? Do we assert?

@klutzy
Copy link
Contributor

klutzy commented Oct 12, 2013

most of winapi/libc functions taking char* just don't use utf8. This is an example I actually met before.
I made a meta-issue #9822 since it looks bizarre for non-windows people and it is problematic everywhere.

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 12, 2013

Sorry, "most of char version of functions" was not appropriate. Yes, it's safe to use char-using functions if they just treat it as ASCII (atoi, strtol, ...) or "null-terminated" string (strcpy, strlen, ....) As far as I know locale (setlocale, strptime, ...) or io (fopen, readdir, ...) related functions are problematic.

If a locale related argument (like %A, %B, %X, %Z, etc) is given to strftime on Windows, it can return ANSI string, which Rust cannot accept (UTF-8 assert). On the other hand, non-ASCII UTF-8 string is given to fopen, it will fail.

@alexcrichton
Copy link
Member

So just to make sure I understand this correctly, if windows hands us a char*, then it works for all files, it's just not necessarily utf-8, so we use the W versions of functions to hand us utf-16 strings that we can re-encode in utf-8. And then if we call fopen with a string exercising non-ascii characters (but nonzero characters), the function just outright fails? Does it abort? Does it return failure? If it returns failure, what's the strerror reported for it?

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 13, 2013

if windows hands us a char*, then it works for all files,

No, it doesn't work for some files which contain non-ANSI[1] character as file name.

so we use the W versions of functions to hand us utf-16 strings that we can re-encode in utf-8.

One more reason is A versions of functions cannot handle some files as I wrote previous sentence.

And then if we call fopen with a string exercising non-ascii characters [...]

fopen returns NULL, and strerror states "No such file or directory".

[1] ANSI is local system encoding except unicode encodings, such as ISO-8859-1, CP949, Shift_JIS. Windows has (almost) fully featured unicode functionality, but ANSI ones remain for backward compatibility.

@alexcrichton
Copy link
Member

OK that makes sense to me now, so could you do a few things?

  • Add some comments to std::os explaining why different versions are used for windows
  • Add some tests as well which create paths with some fun filenames, and then test that these functions succeed on windows?

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 18, 2013

Added comments, tests, and rebased.

Force push... :(

rust_path_exists
rust_path_exists_u16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will fail to compile on osx and linux because these symbols aren't defined in rust_builtin.cpp. Feel free to just make them empty functions, that's been the pattern for platform-specific C++ helpers so far.

@alexcrichton
Copy link
Member

Nice job, and thanks again! Just two things:

  • Would you mind rebasing these into one commit? It's a pretty good change for one decent-sized commit, and the commit can have all the details about what went on.
  • Minor comment about stub versions of rust_path_exists_u16 and friends for linux/osx

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 18, 2013

Thanks for review! I think it's all done, but if there's something wrong, please let me know.

@nitric1
Copy link
Contributor Author

nitric1 commented Oct 18, 2013

I didn't consider linux; fix amended.

bors added a commit that referenced this pull request Oct 18, 2013
This fixes #9418 and #9618, and potential problems related to directory walking.
@klutzy
Copy link
Contributor

klutzy commented Oct 19, 2013

path_is_dir test failed on mac-64-nopt-t, but passed on mac-64-opt. Is it just flaky?

@alexcrichton
Copy link
Member

Hm, this is using some test files in the os's tmp directory, but builders can run simultaneously on one machine, so I don't think that this is playing nicely when that's happening. Each test needs to choose a unique location to test its paths/directories. There isn't the convenience of extra's tmpdir interface, but you could work around it but just generating some random strings to prepend on the path instead.

… rust_localtime.

This make these functions use wchar_t version of APIs, instead of char version.
@nitric1
Copy link
Contributor Author

nitric1 commented Oct 20, 2013

Temporary path randomized; there will not be race conditions, hopefully.

bors added a commit that referenced this pull request Oct 20, 2013
This fixes #9418 and #9618, and potential problems related to directory walking.
@bors bors closed this Oct 20, 2013
@bors bors merged commit 3e53c92 into rust-lang:master Oct 20, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extra::time: test failure on Win32 non-English locale
4 participants