Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UTF-8 Everywhere #269

Closed
4A696D opened this issue Nov 7, 2017 · 11 comments
Closed

Support UTF-8 Everywhere #269

4A696D opened this issue Nov 7, 2017 · 11 comments

Comments

@4A696D
Copy link

4A696D commented Nov 7, 2017

In order to support the UTF-8 Everywhere principle, please consider adding the following to hstring:

  • a constructor and assignment operator taking a std::string containing UTF-8 encoded text

  • a conversion operator returning a std::string containing UTF-8 encoded text

Thanks!

@kennykerr
Copy link
Collaborator

We are very concerned about UTF-8 support on Windows, but very unlikely to add implicit conversions as it can have unintended performance consequences. Same reason there's no implicit conversion between std::string and std::wstring. We have some scenarios on Windows where this really hurts performance. It's a hard problem and we are working on it. Thanks for the feedback!

@4A696D
Copy link
Author

4A696D commented Nov 9, 2017

Surely it's up to the cppwinrt user to decide whether the cost of conversion is acceptable. I've been doing UTF-8 Everywhere for several years with WinAPI apps and it hasn't caused a single performance issue. And supposing such an issue did arise, it should be pretty simple to solve (for instance by using UTF-16 in the affected section of code).

I think the reason there's no conversion between std::string and std::wstring is mainly that the standard library can't assume any particular character encoding. But cppwinrt is certainly in a position to make such an assumption. Indeed, I notice that the current version already includes a function to_hstring, which takes a std::string_view that is assumed to reference UTF-8 data!

If cppwinrt included the implicit conversions I requested, it really would be a lot more attractive to folks like me who use UTF-8 everywhere.

@MikeGitb
Copy link

MikeGitb commented Nov 9, 2017

Imho things like string conversion should be explicit, as this can actually be not only a performance but also a correctness issue: A std::string might not actually contain a utf8 encoded string and IIRC, NTFS paths might actually contain 16Bit values tht don't form valid utf-16 encoded code points (I hope I got the terminology correct).

That being said. Those explicit transformations should be as convenient as possible.

@tim-weis
Copy link

tim-weis commented Nov 9, 2017

I'd make a strong point against implementing conversion constructors and operators. Beyond the performance implications there is the issue about correctness. While a wchar_t/std::wstring is implied to use UTF-16LE, there is no such convention for char/std::string. The latter could be ASCII, ANSI, UTF-8, Shift JIS, or any other character encoding (with ANSI being the most common).

Conversions must be explicit. Otherwise the ambiguity around char/std::string will make for subtle bugs, when the compiler is silenced by providing implicit conversions, that may or may not do, what you want.

Besides, the "UTF-8 Everywhere" mantra is more dogmatic than convincing. UTF-8 is great, for information interchange (writing files to disk, sending data across a network, etc.). For a Windows application I have yet to see a convincing argument against using UTF-16 internally throughout.

@4A696D
Copy link
Author

4A696D commented Nov 9, 2017

Have you guys actually read the UTF-8 Everywhere manifesto?

@tim-weis
Copy link

tim-weis commented Nov 9, 2017

Yes. And it isn't very convincing. UTF-8 is great for data interchange. It isn't exactly well suited as an internal representation for text in a Windows application.

As for this specific issue, you need to explain, why assuming UTF-8 in a general purpose library is more important than allowing it to easily interface with legacy code that uses ANSI encoding.

@MikeGitb
Copy link

MikeGitb commented Nov 9, 2017

@4A696D: Yes I have and in any code that is portable I try to follow it (doing so in c++ is not always easy though as long as there is no standardized utf8 string). However, the fact of the matter is, that windows APIs (as well as Java and Qt for that matter) use mostly wchars / utf-16 and the roundtrip windows API string -> utf-8 -> windows API string is not efficient, not always correct (although that are probably mostly very specific corner cases or bugs) and often simply not necessary.

As I said. There definitely should be an easy way to do the conversion, but it should not be hidden.

@MikeGitb
Copy link

MikeGitb commented Nov 9, 2017

Also, we are coming pretty close to " If you have to do this all the time you should rethink your design" territory.

@kennykerr
Copy link
Collaborator

Our internal builds now have winrt::to_hstring for std::string_view to winrt::hstring conversion as well as winrt::to_string for std::wstring_view to std::string conversion. winrt::hstring is convertible to std::wstring_view. This should make it a lot more convenient to work with UTF-8 in C++/WinRT apps while remaining explicit.

@4A696D
Copy link
Author

4A696D commented Nov 21, 2017

Thanks Kenny. However, this doesn't really offer cppwinrt users anything they couldn't already do for themselves. To avoid having to litter UTF-8 based code with endless to_hstring and to_string calls, one must modify cppwinrt itself, which is hardly ideal.

@kennykerr
Copy link
Collaborator

Right, the helpers are merely provided as a convenience. Feel free to use them or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants