Implement til::u8u16 and til::u16u8 conversion functions #4093

german-one · 2019-12-31T22:36:40Z

Summary of the Pull Request

Remove UTF8OutputPipeReader, move the pipe reading back to ConptyConnection::_OutputThread().
Implement UTF-8 <--> UTF-16 conversion ~~in user mode. Enable to toggle between ignoring invalid UTF-8 and replacing it with U+FFFD.~~ See Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378
Implement a re-usable partials handling.

PR Checklist

Closes Implement UTF-8 <--> UTF-16 conversion in user mode. #4092
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

On my list:

✔ Implement UTF-8 <--> UTF-16 conversion ~~in user mode~~ (this PR)
1.1. ✔ ~~Transpose my U8ToU16() and U16ToU8() C --> C++~~ (obsolet)
1.2. ✔ Implement functors for partials handling
1.3. ✔ Implement functors to do both the partials handling and the conversion task at once
1.4. ✔ Supersede Utf8OutPipeReader and remove it from the code base to avoid further use
❌ Unify UTF-8 handling (supersede Utf8ToWideCharParser)
2.1. ❌ Update VtInputThread::_HandleRunInput()
2.2. ❌ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
❌ Enable BOM discarding
3.1. ❌ Implement an enum class containing flags for U8ToU16() and U16ToU8() to enable discarding both BOM and/or invalids
3.2. ❌ Replace the 3rd parameter of U8ToU16(), U16ToU8(), and related functors with the enum and update the function code accordingly
3.3. ❌ Make use of the 3rd parameter to discard the BOM in all current functor callers, or (optional / ask the core team) make it the default for U8ToU16() and U16ToU8()
❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too

Validation Steps Performed

Unit tests implemented.
Loads of manual tests to evaluate the behavior of the implemented code.

german-one · 2019-12-31T22:54:20Z

Manual comparisons: https://github.com/german-one/u8u16

Note: All tests have been made with the same number (100,000) of code point U+20AC Euro Sign. It takes 3 UTF-8 code units and 1 UTF-16 code unit.
Note: The time measured includes the memory allocation (using make_unique for the API functions).
Note: All functions were called only once to perform the conversion of the whole string.

UTF-8 to UTF-16

MultiByteToWideChar
 length 100000000
 elapsed 2.23501
~~~~~
RtlUTF8ToUnicodeN
 NTSTATUS 0
 length 100000000
 elapsed 2.26018
~~~~~
U8ToU16
 HRESULT 0
 length 100000000
 elapsed 2.38592

UTF-16 to UTF-8

WideCharToMultiByte
 length 300000000
 elapsed 1.45804
~~~~~
RtlUnicodeToUTF8N
 NTSTATUS 0
 length 300000000
 elapsed 1.65566
~~~~~
U16ToU8
 HRESULT 0
 length 300000000
 elapsed 2.41888
~~~~~

miniksa · 2020-01-03T21:50:31Z

✔ Implement UTF-8 <--> UTF-16 conversion in user mode (this PR)
1.1. ✔ Transpose my U8ToU16() and U16ToU8() C --> C++
1.2. ✔ Implement functors for partials handling
1.3. ✔ Implement functors to do both the partials handling and the conversion task at once
1.4. ✔ Supersede Utf8OutPipeReader and remove it from the code base to avoid further use

❌ Unify UTF-8 handling (supersede Utf8ToWideCharParser)
2.1. ❌ Update VtInputThread::_HandleRunInput()
2.2. ❌ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use

❌ Enable BOM discarding
3.1. ❌ Implement an enum class containing flags for U8ToU16() and U16ToU8() to enable discarding both BOM and/or invalids
3.2. ❌ Replace the 3rd parameter of U8ToU16(), U16ToU8(), and related functors with the enum and update the function code accordingly
3.3. ❌ Make use of the 3rd parameter to discard the BOM in all current functor callers, or (optional / ask the core team) make it the default for U8ToU16() and U16ToU8()

❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too

Do you intend to finish all of these in this PR, or is this PR just for 1. and you plan to open more PRs for the rest?

Manual comparisons: german-one/u8u16

You are welcome to submit this utility into our repository as well under the src\tools directory which is where we usually put things that we craft in the course of development to aid our work.

Note: All tests have been made with the same number (100,000) of code point U+20AC Euro Sign. It takes 3 UTF-8 code units and 1 UTF-16 code unit.

Note: The time measured includes the memory allocation (using make_unique for the API functions).

Note: All functions were called only once to perform the conversion of the whole string.

I presume you did this with an opt build of your code? I'm surprised that it's looking slower than calling out.
Can you do a batch of small conversions as a comparison instead of the one big conversion? The performance gain I'm expecting is removing the overhead of doing all the syscalls on conversion of tiny strings (1-10 characters long). If long strings are comparable and short strings many times are faster, then we're good and it works as I expect.
Did you run each test/situation multiple times and average the results?

german-one · 2020-01-03T22:56:20Z

Do you intend to finish all of these in this PR, or is this PR just for 1. and you plan to open more PRs for the rest?

This is for item 1. I'll open a new PR for each of the main items because I'm afraid it's too much for only one PR and I don't want to overwhelm you with the code review. It's already a lot of new code. And I want to get your OK for the hand-rolled conversions first.

You are welcome to submit this utility into our repository as well under the src\tools directory which is where we usually put things that we craft in the course of development to aid our work.

Thanks for the hint! I'll do so.

I presume you did this with an opt build of your code? I'm surprised that it's looking slower than calling out.

Yes. And don't look at the absolute figures. I made the tests on my little Netbook which is slow as hell. But comfortable for coding on the couch 😁

Can you do a batch of small conversions as a comparison instead of the one big conversion? The performance gain I'm expecting is removing the overhead of doing all the syscalls on conversion of tiny strings (1-10 characters long). If long strings are comparable and short strings many times are faster, then we're good and it works as I expect.

OK, coming soon.

Did you run each test/situation multiple times and average the results?

Yes but the example I posted above is just a typical single result. They don't differ much.

miniksa

I like where this is going.

My major concern is that there are two big entrypoints for how you want to consume the conversions:

If you want to just consume everything and not deal with partials, you can use the more elegant U8ToU16/U16ToU8 functions.
If you need to deal with partials, you have to stand up the entire class of UTF8ChunkToUTF16Converter and then hold onto it and pass through that. Also it's an extremely verbose name that's a lot to type (and prone to typographical error).

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.

Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

For example:
std::wstring til::u8u16(std::string_view in, bool discardInvalid = false)
std::wstring til::u8u16(std::string_view in, til::u8state& state, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, til::u16state& state, bool discardInvalid = false)

and the nothrow siblings
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, til::u8state& state, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, til::u16state& state, bool discardInvalid = false) nothrow

could be the 8 public entry points into this conversion operation and are effective one-liners for all consumers.

Those who want to preserve partials across boundaries could provide the optional til::u8state or til::u16state which would just be public structs that held up to 3 utf8 leads or 1 utf16 high surrogate as applicable across calls.

til::u8state state;
while (true)
{
   std::array<byte, 4096> buffer;
   DWORD read;
   ... // read the thing
   const auto wideText = til::u8u16({buffer.data(), read}, state, true);
   ... // pass wideText on.
}

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

@DHowett-MSFT and I were talking about this verbally, Dustin what do you think of that as a quick-spec?

src/cascadia/TerminalConnection/ConptyConnection.cpp

src/types/inc/Utf8Utf16Convert.hpp

german-one · 2020-01-04T11:18:59Z

I like where this is going.
My major concern is that there are two big entrypoints for how you want to consume the conversions:

If you want to just consume everything and not deal with partials, you can use the more elegant U8ToU16/U16ToU8 functions.
If you need to deal with partials, you have to stand up the entire class of UTF8ChunkToUTF16Converter and then hold onto it and pass through that. Also it's an extremely verbose name that's a lot to type (and prone to typographical error).

I see your point. My intention was to have it all modules that can be combined new if you need a more specialized class in future. But I agree that this might never be the case if I write proper overloads for the conversions like your proposals.

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.
Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

As I understood "til" is for "Terminal Implementation Library". Is this still OK for code that is intended to get shared with conhost later?

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

In the throwing code we will loose the information whether or not invalid codepoints have been found. That is probably no problem at all. Just a heads up.

german-one · 2020-01-05T23:32:29Z

I followed @miniksa 's suggestion to commit U8U16Test to the tools. There is comment TEST TOOL U8U16Test at the beginning of these files that hopefully makes it easier for you to ignore them in the code reviews.

Furthermore I found and removed a game stopper in the code. I thought I was smart and called a .shrink_to_fit() at the end of the main conversion functions to save memory space. That didn't have any big influence as long as the whole string was converted at once. But in subsequent calls of the functions it caused an huge loss of performance.

I still owe you the test results of the subsequent conversions of small strings.
The ..._WholeString tests still show the elapsed time of the conversion of the whole string at once.
The ..._Chuncs tests convert the same amount of code points in total but only 10 at a time.
(Ignore the ignore me value 😉 This is just a randomly picked character in order to ensure the compiler doesn't opt out the actual string creation.)

### UTF-8 To UTF-16 ###

~~~
test "MultiByteToWideChar_WholeString"
 ignore me 8364
 length 100000000
 elapsed 2.26513

~~~
test "RtlUTF8ToUnicodeN_WholeString"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 2.30559

~~~
test "u8u16_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.03403

~~~
test "MultiByteToWideChar_Chunks"
 ignore me 8364
 length 100000000
 elapsed 5.43452

~~~
test "RtlUTF8ToUnicodeN_Chunks"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 5.12312

~~~
test "u8u16_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 3.79081

### UTF-16 To UTF-8 ###

~~~
test "WideCharToMultiByte_WholeString"
 ignore me 130
 length 300000000
 elapsed 1.61061

~~~
test "RtlUnicodeToUTF8N_WholeString"
 ignore me 172
 NTSTATUS 0
 length 300000000
 elapsed 1.60669

~~~
test "u16u8_WholeString"
 ignore me 172
 HRESULT 0
 length 300000000
 elapsed 2.55012

~~~
test "WideCharToMultiByte_Chunks"
 ignore me 172
 length 300000000
 elapsed 4.40411

~~~
test "RtlUnicodeToUTF8N_Chunks"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 3.84753

~~~
test "u16u8_Chunks"
 ignore me 226
 HRESULT 0
 length 300000000
 elapsed 3.94672

EDIT: There was a quirk in my test loop. Updated the figures above.

I think I sorted it all out to be ready for the next code review.

miniksa · 2020-01-06T23:55:08Z

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.
Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

As I understood "til" is for "Terminal Implementation Library". Is this still OK for code that is intended to get shared with conhost later?

Yes, that's fine.

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

In the throwing code we will loose the information whether or not invalid codepoints have been found. That is probably no problem at all. Just a heads up.

Lose what information specifically? The "whether or not there was an invalid codepoint"?

As long as it doesn't look like it will influence the way that we write code (that is, the point of the throwing ones is so we can write const auto u8 = til::u16u8(u16, true); and be reasonably assured we're getting what we need into a const std::string... then I'm good with it.

(... pretend this is a quote of the testing results ...)

The u8u16 ones look great. Excellent.
I'm a bit concerned about the u16u8 whole string performance loss. The chunks look comparable, though. Do we know what causes that?

german-one · 2020-01-07T00:31:46Z

Yes, that's fine.

OK, I'll move the code into the til namespace then.

Lose what information specifically? The "whether or not there was an invalid codepoint"?

As long as it doesn't look like it will influence the way that we write code (that is, the point of the throwing ones is so we can write const auto u8 = til::u16u8(u16, true); and be reasonably assured we're getting what we need into a const std::string... then I'm good with it.

The true is for ignoring invalid codepoints. They are not replaced with U+FFFD and just left out. The non-throwing ones would still return S_FALSE if an invalid character has been found. But it wouldn't make sense to throw in this case, would it?

The u8u16 ones look great. Excellent.
I'm a bit concerned about the u16u8 whole string performance loss. The chunks look comparable, though. Do we know what causes that?

Unfortunately I don't know why. Out of my experiences I can tell that iterations in steps of 16 bits is slower than steps of 32 bits. So my assumption is that the API functions process the string in steps of register size. That might be the reason why WideCharToMultiByte is noticeable faster than MultiByteToWideChar.

fcharlie · 2020-01-07T01:30:13Z

@miniksa @german-one I think the reason why u16u8 is slow may be caused by push_back, push_back still needs to check whether the buffer is large enough, and each call still needs to append NULL truncated string

One solution is to resize the temporary std :: string to 3 * wstr.size () and then use str.data () (non const) directly as the buffer, and then use resize to correct the string length .

This solution may consume more memory.

There are related articles here for reference to achieve faster UTF-16 <-> UTF-8 conversion.

CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

UTF-8 processing using SIMD (SSE4)

https://github.com/cyb70289/utf8

german-one · 2020-01-07T01:49:50Z

One solution is to resize the temporary std :: string to 3 * wstr.size () and then use str.data () (non const) directly as the buffer, and then use resize to correct the string length .

I'll perform some tests with your proposal tomorrow. Thanks!

This solution may consume more memory.

I don't think so. I already use .reserve() in the current code to allocate memory for three times as many code units. This was a performance improvement of ~30%.

fcharlie · 2020-01-07T02:02:45Z

@german-one Using pointers directly must ensure that the buffer is large enough. In addition, I suggest you use explicit types such as char32_t char16_t in your code.

german-one · 2020-01-07T02:17:09Z

@german-one Using pointers directly must ensure that the buffer is large enough.

Of course. Factor 3 is the worst case. so that's good enough.

In addition, I suggest you use explicit types such as char32_t char16_t in your code.

No. Those are least types. No idea how those types came into the standard. We need types with exact size. And I prefer to use the value_type of the wstring because that's what we compose out of the values.

german-one · 2020-01-07T17:33:11Z

@fcharlie I added pointer versions to the test tool. The outcome is surprising and somehow unexpected. While using pointers works great for the UTF-16 to UTF-8 conversion, it makes things worse in the UTF-8 to UTF-16 conversion. That's how the results look like:

### UTF-8 To UTF-16 ###

~~~
test "MultiByteToWideChar_WholeString"
 ignore me 8364
 length 100000000
 elapsed 2.26531

~~~
test "RtlUTF8ToUnicodeN_WholeString"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 2.27432

~~~
test "u8u16_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.04221

~~~
test "u8u16_ptr_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.41866

~~~
test "MultiByteToWideChar_Chunks"
 ignore me 8364
 length 100000000
 elapsed 5.4539

~~~
test "RtlUTF8ToUnicodeN_Chunks"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 5.11386

~~~
test "u8u16_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 3.79609

~~~
test "u8u16_ptr_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 4.10265

### UTF-16 To UTF-8 ###

~~~
test "WideCharToMultiByte_WholeString"
 ignore me 130
 length 300000000
 elapsed 1.60714

~~~
test "RtlUnicodeToUTF8N_WholeString"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 1.61687

~~~
test "u16u8_WholeString"
 ignore me 226
 HRESULT 0
 length 300000000
 elapsed 2.59101

~~~
test "u16u8_ptr_WholeString"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 1.4677

~~~
test "WideCharToMultiByte_Chunks"
 ignore me 226
 length 300000000
 elapsed 4.55823

~~~
test "RtlUnicodeToUTF8N_Chunks"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 4.08254

~~~
test "u16u8_Chunks"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 3.93425

~~~
test "u16u8_ptr_Chunks"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 3.64682

@miniksa Do you think I should commit the pointer version for til::u16u8?

miniksa

@miniksa Michael Niksa FTE Do you think I should commit the pointer version for til::u16u8?

As long as all the pointers are managed internally to the function and you have sufficient bounds checking, then yes. We should do whatever it takes to squeeze max performance out of these little functions. Leave the comment behind explaining the decision, though, inside the code.

german-one · 2020-01-07T20:49:11Z

@miniksa Thanks for your guidance. One UTF-16 code unit can become three UTF-8 code units in the worst case. The string::resize() throws if it fails which is cought. I left comments why pointers are used and suppressed warnings in this code area for the code analysis.

fcharlie · 2020-01-08T01:22:39Z

@german-one I took a look at your test code. The variable testU16 contains only the code point u20AC. I suggest that you use more diverse test data so that it can be more in line with the actual situation.

The benchmark difference may also be caused by branch prediction, caching, etc. Excessive if-else may be slow, and there is an article about this problem: A Branchless UTF-8 Decoder
So we usually use the lookup table method to get the string length of UTF-8 Unicode code points, which is available in the LLVM source code: trailingBytesForUTF8 (this is also standard practice at unicode.org).

u8-> u16 encoding conversion, many people have done research, you can go to see their article or source code (reactOS source code based on GPLv2 it is necessary to check), maybe a better idea.

In addition, you can try to use SIMD and other instructions to achieve encoding conversion.

german-one · 2020-01-08T08:17:13Z

@fcharlie I already had a look at the links you previously posted and other examples. And I took in consideration to use lookup tables to avoid unnecessary branching and expensive calculations. Now that the code is moved into the til namespace I have better options to outsource auxiliary code into a details or internal sub-namespace. Also I agree with you that more tests are necessary, especial using different natural languages. Right now it's just a quick assessment to get an idea how it behaves compared with the API functions and if it would even make sense to implement an own conversion. And as it seems we would already gain a big benefit with the the current u8u16 function. Consider what we are still doing with the API functions. E.g. call it once to figure out if it fails and then crawl manually over the string to remove invalid characters. Or, call it once to find out how much memory space is required and then run it a second time to perform the actual conversion.

@miniksa I read about difficulties you will probably get to determine string lengths if it contains surrogate pairs in #780. You may have already seen that the conversions between the UTF encodings always require the calculation of the code point in UTF-32 representation. It would be quite simple to have additional functions that convert from and to a u32string. I could imagine that consuming UTF-32 is much easier because one character is always represented by one UTF-32 code unit. Any thoughts on that?

miniksa · 2020-01-08T19:08:31Z

@german-one, I'm hesitant on the u32 versions. I'm concerned that:

Converting everything further from u16 to u32 for the sake of dealing with u16 surrogates is going to make the "fast case" exceptionally slow.
Moving everything to u32 is going to consume way too much memory such as to only do the conversion once. (u8u32 instead of u8u16 then u16u32).

I think the compromise I was going to go for is simply checking the lead surrogate range in the stream write.

german-one · 2020-01-08T22:15:43Z

@miniksa OK no problem. I just wanted to note that we are ready for that conversation, too.

zadjii-msft · 2020-01-09T16:54:09Z

@msftbot make sure @miniksa signs off on this

ghost · 2020-01-09T16:54:14Z

Hello @zadjii-msft!

Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:

I'll only merge this pull request if it's approved by @miniksa

If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you".

zadjii-msft · 2020-01-09T16:54:47Z

@msftbot merge this in 168 hours

(this is to make sure this doesn't automerge before the 0.8 build snap. The rest of team can override me on this if they want)

ghost · 2020-01-09T16:54:51Z

Hello @zadjii-msft!

Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:

I'll only merge this pull request if it's approved by @miniksa
I won't merge this pull request until after the UTC date Thu, 16 Jan 2020 16:54:50 GMT, which is in 7 days

If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you".

zadjii-msft

Frankly, I don't totally understand u8/u16 conversion, but I trust that it's just bit magic like this. I think I'm okay with this. I got a little confused by the U8U16Test project at first, but I think I get it now. My comments aren't totally blocking IMO. I'm excited about the potential perf gains we can get from applying this in other places as well.

Thanks for the hard work!

src/tools/U8U16Test/U8U16Test.hpp

src/tools/U8U16Test/main.cpp

… til::u16u8

…uages

…mments which explain why

german-one · 2020-01-18T13:27:59Z

PHEW 😅
I repeated all the previous steps. No idea where I was going wrong before.

FWIW Of course I trust you. And I'm sure you would have managed everything in no time. But I prefer to go the hard way in order to learn something new. Thank you so much for your git lessons and your patience! You've been a great help.

lhecker · 2020-01-18T13:32:18Z

@german-one Nice on! 👍 And... No problem of course. 😄

By the way, since I've seen many who don't know this: If you force-push something to GitHub, it'll show as:

$name force-pushed the $name:$branch branch from $old to $new $age minutes ago

The "force-pushed" part in there is a link and clickable and can be used to quickly ensure what has changed during a rebase by yourself or someone else (if you ever need to review some code).

german-one · 2020-01-18T13:42:46Z

@miniksa Sorry for all that. For now I will definitely refrain from further experiments like this.
Do you want me to file a follow up for incorporating the chromium safe math here? Maybe I should wait until we have it in the til namespace anyway.

miniksa · 2020-01-18T15:57:26Z

@miniksa Sorry for all that. For now I will definitely refrain from further experiments like this.
Do you want me to file a follow up for incorporating the chromium safe math here? Maybe I should wait until we have it in the til namespace anyway.

@german-one do not be sorry at all. Do experiments as you wish. I definitely learned plenty from this one and you saved me the time of figuring this out myself one day in the future. I really appreciate it even if it will turn out more limited than we initially estimated. Your effort was extremely worthwhile.

Maybe file another task for safe math because @skyline75489 is still helping me solve that experiment in another PR.

german-one · 2020-01-18T16:55:37Z

I filed #4290 for the safe math.

@german-one do not be sorry at all. Do experiments as you wish.

That was rather related to my recent git experiments 😆 You probably didn't see the havoc I caused when I tried to update my fork with all the PRs that landed meanwhile. @lhecker helped a lot to sort this out.

But yea, I tried hard to never leave you in the dark. This led the PR becoming a long thread of comments and additional code, but my intention was to give you the opportunity to follow all of the steps.

german-one · 2020-01-22T16:43:39Z

@miniksa The point that I was not able to address yet is whether or not the files should be moved into the src\inc\til folder. Your concern was that the implementation is in a .cpp file which doesn't fit well. My idea is to move them into src\til and add Utf8Utf16Convert.hpp to the includes in src\inc\til.h. What do you think?

miniksa · 2020-01-22T22:16:50Z

@miniksa Michael Niksa FTE The point that I was not able to address yet is whether or not the files should be moved into the src\inc\til folder. Your concern was that the implementation is in a .cpp file which doesn't fit well. My idea is to move them into src\til and add Utf8Utf16Convert.hpp to the includes in src\inc\til.h. What do you think?

@DHowett-MSFT, if it is all in the header file, the linker will take care of it, right? We don't technically need it in a LIB/CPP?

german-one · 2020-01-26T00:43:45Z

@miniksa I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

FWIW I can't tell why the x86 build fails. Seems to be unrelated to these updates.

miniksa · 2020-01-27T18:23:49Z

@miniksa Michael Niksa FTE I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

FWIW I can't tell why the x86 build fails. Seems to be unrelated to these updates.

Looks like x86 succeeded this time.

miniksa · 2020-01-27T18:25:36Z

@miniksa Michael Niksa FTE I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

Also, I don't have a better idea. But @DHowett-MSFT might. However, slightly clumsy templates are better than slightly clumsy other-code. So I'm inclined to call this acceptable because it should make the rest of the code look very nice and clean.

miniksa

This looks good to me. I'd still like @DHowett-MSFT to see it before we commit it, though.

german-one · 2020-01-27T19:43:13Z

Thanks @miniksa !

@DHowett-MSFT I guess you are afraid of the 2.2 k of new lines of code to review 😉 Don't worry, it's not that bad. It's rather only the ~450 lines in u8u16convert.h. Most of the rest belongs to the testing tool.

DHowett-MSFT

I love this. Thank you. The templates look great!

Replace `utf8Parser` with `til::u8u16` in order to have the same conversion algorithms used in terminal and conhost. This PR addresses item 2 in this list: 1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093) 2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)** 2.1. ✔ **Update VtInputThread::_HandleRunInput()** 2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()** 2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use 3. ❌ Enable BOM discarding (follow up) 3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM 3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for `til::u8u16` and `til::u16u8` 4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up) Closes #4086 Closes #3378

DHowett-MSFT · 2020-04-08T19:59:46Z

🎉 Once again, thanks for the contribution!

This pull request was included in a set of conhost changes that was just
released with Windows Insider Build 19603.

This was referenced Jan 1, 2020

Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378

Closed

Certain invalid UTF-8 sequences can cause the output to fail #4086

Closed

miniksa reviewed Jan 3, 2020

View reviewed changes

zadjii-msft added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Product-Meta The product is the management of the products. labels Jan 6, 2020

miniksa reviewed Jan 7, 2020

View reviewed changes

zadjii-msft mentioned this pull request Jan 8, 2020

Add support for commandline args to wt.exe #4023

Merged

5 tasks

ghost added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jan 9, 2020

zadjii-msft approved these changes Jan 9, 2020

View reviewed changes

src/tools/U8U16Test/U8U16Test.hpp Show resolved Hide resolved

src/tools/U8U16Test/main.cpp Show resolved Hide resolved

src/tools/U8U16Test/main.cpp Show resolved Hide resolved

german-one added 6 commits January 18, 2020 14:12

pass formatting check

1aa0237

implement special treatment for ASCII code points into til::u8u16 and…

81cd3bb

… til::u16u8

pass static analysis

ba738ff

update performance tests - implement chunk conversion of natural lang…

bb674df

…uages

replace own conversion algorithms with platform API functions, add co…

3bebd50

…mments which explain why

PR feedback

cd54f74

add attribution to sample text files

95ba35e

german-one mentioned this pull request Jan 18, 2020

Replace intsafe functions with chromium safe math in Utf8Utf16Convert.cpp #4290

Closed

use templates to make it header-only, move to ./src/inc/til/

cc469a6

update comments

83fd45f

miniksa approved these changes Jan 27, 2020

View reviewed changes

DHowett-MSFT approved these changes Jan 30, 2020

View reviewed changes

DHowett-MSFT merged commit 32ea419 into microsoft:master Jan 30, 2020

german-one mentioned this pull request Jan 31, 2020

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Merged

5 tasks

german-one mentioned this pull request Jul 21, 2021

Release unneeded memory more eagerly from conhost #10738

Merged

3 tasks

lhecker mentioned this pull request Jul 24, 2024

Add support for markdown -> XAML parsing #17585

Merged

thaliaarchi mentioned this pull request Oct 16, 2025

[stdio][windows] Use MBTWC and WCTMB rust-lang/rust#107110

Merged

Implement til::u8u16 and til::u16u8 conversion functions #4093

Implement til::u8u16 and til::u16u8 conversion functions #4093

Uh oh!

Conversation

german-one commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of the Pull Request

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

Uh oh!

german-one commented Dec 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miniksa commented Jan 3, 2020

Uh oh!

german-one commented Jan 3, 2020

Uh oh!

miniksa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

german-one commented Jan 4, 2020

Uh oh!

german-one commented Jan 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miniksa commented Jan 6, 2020

Uh oh!

german-one commented Jan 7, 2020

Uh oh!

fcharlie commented Jan 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

german-one commented Jan 7, 2020

Uh oh!

fcharlie commented Jan 7, 2020

Uh oh!

german-one commented Jan 7, 2020

Uh oh!

german-one commented Jan 7, 2020

Uh oh!

miniksa left a comment

Choose a reason for hiding this comment

Uh oh!

german-one commented Jan 7, 2020

Uh oh!

fcharlie commented Jan 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

german-one commented Jan 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miniksa commented Jan 8, 2020

Uh oh!

german-one commented Jan 8, 2020

Uh oh!

zadjii-msft commented Jan 9, 2020

Uh oh!

ghost commented Jan 9, 2020

Uh oh!

zadjii-msft commented Jan 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jan 9, 2020

Uh oh!

zadjii-msft left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

german-one commented Jan 18, 2020

Uh oh!

lhecker commented Jan 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

german-one commented Dec 31, 2019 •

edited

Loading

german-one commented Dec 31, 2019 •

edited

Loading

german-one commented Jan 5, 2020 •

edited

Loading

fcharlie commented Jan 7, 2020 •

edited

Loading

fcharlie commented Jan 8, 2020 •

edited

Loading

german-one commented Jan 8, 2020 •

edited

Loading

zadjii-msft commented Jan 9, 2020 •

edited

Loading

lhecker commented Jan 18, 2020 •

edited

Loading

german-one commented Jan 18, 2020 •

edited

Loading

german-one commented Jan 22, 2020 •

edited

Loading