-
Notifications
You must be signed in to change notification settings - Fork 9k
Implement til::u8u16 and til::u16u8 conversion functions #4093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Manual comparisons: https://github.com/german-one/u8u16
|
Do you intend to finish all of these in this PR, or is this PR just for 1. and you plan to open more PRs for the rest?
You are welcome to submit this utility into our repository as well under the
|
This is for item 1. I'll open a new PR for each of the main items because I'm afraid it's too much for only one PR and I don't want to overwhelm you with the code review. It's already a lot of new code. And I want to get your OK for the hand-rolled conversions first.
Thanks for the hint! I'll do so.
Yes. And don't look at the absolute figures. I made the tests on my little Netbook which is slow as hell. But comfortable for coding on the couch 😁
OK, coming soon.
Yes but the example I posted above is just a typical single result. They don't differ much. |
miniksa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going.
My major concern is that there are two big entrypoints for how you want to consume the conversions:
- If you want to just consume everything and not deal with partials, you can use the more elegant
U8ToU16/U16ToU8functions. - If you need to deal with partials, you have to stand up the entire class of
UTF8ChunkToUTF16Converterand then hold onto it and pass through that. Also it's an extremely verbose name that's a lot to type (and prone to typographical error).
I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.
Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.
For example:
std::wstring til::u8u16(std::string_view in, bool discardInvalid = false)
std::wstring til::u8u16(std::string_view in, til::u8state& state, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, til::u16state& state, bool discardInvalid = false)
and the nothrow siblings
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, til::u8state& state, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, til::u16state& state, bool discardInvalid = false) nothrow
could be the 8 public entry points into this conversion operation and are effective one-liners for all consumers.
Those who want to preserve partials across boundaries could provide the optional til::u8state or til::u16state which would just be public structs that held up to 3 utf8 leads or 1 utf16 high surrogate as applicable across calls.
til::u8state state;
while (true)
{
std::array<byte, 4096> buffer;
DWORD read;
... // read the thing
const auto wideText = til::u8u16({buffer.data(), read}, state, true);
... // pass wideText on.
}
For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.
@DHowett-MSFT and I were talking about this verbally, Dustin what do you think of that as a quick-spec?
I see your point. My intention was to have it all modules that can be combined new if you need a more specialized class in future. But I agree that this might never be the case if I write proper overloads for the conversions like your proposals.
As I understood "til" is for "Terminal Implementation Library". Is this still OK for code that is intended to get shared with conhost later?
In the throwing code we will loose the information whether or not invalid codepoints have been found. That is probably no problem at all. Just a heads up. |
|
I followed @miniksa 's suggestion to commit Furthermore I found and removed a game stopper in the code. I thought I was smart and called a I still owe you the test results of the subsequent conversions of small strings. EDIT: There was a quirk in my test loop. Updated the figures above. I think I sorted it all out to be ready for the next code review. |
Yes, that's fine.
Lose what information specifically? The "whether or not there was an invalid codepoint"? As long as it doesn't look like it will influence the way that we write code (that is, the point of the throwing ones is so we can write
The u8u16 ones look great. Excellent. |
OK, I'll move the code into the til namespace then.
The
Unfortunately I don't know why. Out of my experiences I can tell that iterations in steps of 16 bits is slower than steps of 32 bits. So my assumption is that the API functions process the string in steps of register size. That might be the reason why |
|
@miniksa @german-one I think the reason why u16u8 is slow may be caused by push_back, push_back still needs to check whether the buffer is large enough, and each call still needs to append NULL truncated string One solution is to resize the temporary std :: string to This solution may consume more memory. There are related articles here for reference to achieve faster UTF-16 <-> UTF-8 conversion. CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics” |
I'll perform some tests with your proposal tomorrow. Thanks!
I don't think so. I already use |
|
@german-one Using pointers directly must ensure that the buffer is large enough. In addition, I suggest you use explicit types such as char32_t char16_t in your code. |
Of course. Factor 3 is the worst case. so that's good enough.
No. Those are |
|
@fcharlie I added pointer versions to the test tool. The outcome is surprising and somehow unexpected. While using pointers works great for the UTF-16 to UTF-8 conversion, it makes things worse in the UTF-8 to UTF-16 conversion. That's how the results look like: @miniksa Do you think I should commit the pointer version for |
miniksa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miniksa Michael Niksa FTE Do you think I should commit the pointer version for
til::u16u8?
As long as all the pointers are managed internally to the function and you have sufficient bounds checking, then yes. We should do whatever it takes to squeeze max performance out of these little functions. Leave the comment behind explaining the decision, though, inside the code.
|
@miniksa Thanks for your guidance. One UTF-16 code unit can become three UTF-8 code units in the worst case. The string::resize() throws if it fails which is cought. I left comments why pointers are used and suppressed warnings in this code area for the code analysis. |
|
@german-one I took a look at your test code. The variable The benchmark difference may also be caused by branch prediction, caching, etc. Excessive u8-> u16 encoding conversion, many people have done research, you can go to see their article or source code (reactOS source code based on GPLv2 it is necessary to check), maybe a better idea. In addition, you can try to use SIMD and other instructions to achieve encoding conversion. |
|
@fcharlie I already had a look at the links you previously posted and other examples. And I took in consideration to use lookup tables to avoid unnecessary branching and expensive calculations. Now that the code is moved into the @miniksa I read about difficulties you will probably get to determine string lengths if it contains surrogate pairs in #780. You may have already seen that the conversions between the UTF encodings always require the calculation of the code point in UTF-32 representation. It would be quite simple to have additional functions that convert from and to a u32string. I could imagine that consuming UTF-32 is much easier because one character is always represented by one UTF-32 code unit. Any thoughts on that? |
|
@german-one, I'm hesitant on the u32 versions. I'm concerned that:
I think the compromise I was going to go for is simply checking the lead surrogate range in the stream write. |
|
@miniksa OK no problem. I just wanted to note that we are ready for that conversation, too. |
|
@msftbot make sure @miniksa signs off on this |
|
Hello @zadjii-msft! Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:
If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you". |
|
@msftbot merge this in 168 hours (this is to make sure this doesn't automerge before the 0.8 build snap. The rest of team can override me on this if they want) |
|
Hello @zadjii-msft! Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:
If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you". |
zadjii-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, I don't totally understand u8/u16 conversion, but I trust that it's just bit magic like this. I think I'm okay with this. I got a little confused by the U8U16Test project at first, but I think I get it now. My comments aren't totally blocking IMO. I'm excited about the potential perf gains we can get from applying this in other places as well.
Thanks for the hard work!
…mments which explain why
|
PHEW 😅 FWIW Of course I trust you. And I'm sure you would have managed everything in no time. But I prefer to go the hard way in order to learn something new. Thank you so much for your git lessons and your patience! You've been a great help. |
|
@german-one Nice on! 👍 And... No problem of course. 😄 By the way, since I've seen many who don't know this: If you force-push something to GitHub, it'll show as: The "force-pushed" part in there is a link and clickable and can be used to quickly ensure what has changed during a rebase by yourself or someone else (if you ever need to review some code). |
|
@miniksa Sorry for all that. For now I will definitely refrain from further experiments like this. |
@german-one do not be sorry at all. Do experiments as you wish. I definitely learned plenty from this one and you saved me the time of figuring this out myself one day in the future. I really appreciate it even if it will turn out more limited than we initially estimated. Your effort was extremely worthwhile. Maybe file another task for safe math because @skyline75489 is still helping me solve that experiment in another PR. |
|
I filed #4290 for the safe math.
That was rather related to my recent git experiments 😆 You probably didn't see the havoc I caused when I tried to update my fork with all the PRs that landed meanwhile. @lhecker helped a lot to sort this out. But yea, I tried hard to never leave you in the dark. This led the PR becoming a long thread of comments and additional code, but my intention was to give you the opportunity to follow all of the steps. |
|
@miniksa The point that I was not able to address yet is whether or not the files should be moved into the |
@DHowett-MSFT, if it is all in the header file, the linker will take care of it, right? We don't technically need it in a LIB/CPP? |
|
@miniksa I made it all templates in the header and moved it into FWIW I can't tell why the x86 build fails. Seems to be unrelated to these updates. |
Looks like x86 succeeded this time. |
Also, I don't have a better idea. But @DHowett-MSFT might. However, slightly clumsy templates are better than slightly clumsy other-code. So I'm inclined to call this acceptable because it should make the rest of the code look very nice and clean. |
miniksa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'd still like @DHowett-MSFT to see it before we commit it, though.
|
Thanks @miniksa ! @DHowett-MSFT I guess you are afraid of the 2.2 k of new lines of code to review 😉 Don't worry, it's not that bad. It's rather only the ~450 lines in |
DHowett-MSFT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this. Thank you. The templates look great!
Replace `utf8Parser` with `til::u8u16` in order to have the same conversion algorithms used in terminal and conhost. This PR addresses item 2 in this list: 1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093) 2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)** 2.1. ✔ **Update VtInputThread::_HandleRunInput()** 2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()** 2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use 3. ❌ Enable BOM discarding (follow up) 3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM 3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for `til::u8u16` and `til::u16u8` 4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up) Closes #4086 Closes #3378
|
🎉 Once again, thanks for the contribution! This pull request was included in a set of conhost changes that was just |
Summary of the Pull Request
UTF8OutputPipeReader, move the pipe reading back toConptyConnection::_OutputThread().in user mode. Enable to toggle between ignoring invalid UTF-8 and replacing it with U+FFFD.See Reconcile UTF-8 behavior inutf8ToWideCharParser.cpp#3378PR Checklist
Detailed Description of the Pull Request / Additional comments
On my list:
in user mode(this PR)1.1. ✔
Transpose my U8ToU16() and U16ToU8() C --> C++(obsolet)1.2. ✔ Implement functors for partials handling
1.3. ✔ Implement functors to do both the partials handling and the conversion task at once
1.4. ✔ Supersede Utf8OutPipeReader and remove it from the code base to avoid further use
2.1. ❌ Update VtInputThread::_HandleRunInput()
2.2. ❌ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
3.1. ❌ Implement an
enum classcontaining flags for U8ToU16() and U16ToU8() to enable discarding both BOM and/or invalids3.2. ❌ Replace the 3rd parameter of U8ToU16(), U16ToU8(), and related functors with the enum and update the function code accordingly
3.3. ❌ Make use of the 3rd parameter to discard the BOM in all current functor callers, or (optional / ask the core team) make it the default for U8ToU16() and U16ToU8()
Validation Steps Performed