Skip to content

Conversation

@german-one
Copy link
Contributor

@german-one german-one commented Dec 31, 2019

Summary of the Pull Request

  • Remove UTF8OutputPipeReader, move the pipe reading back to ConptyConnection::_OutputThread().
  • Implement UTF-8 <--> UTF-16 conversion in user mode. Enable to toggle between ignoring invalid UTF-8 and replacing it with U+FFFD. See Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378
  • Implement a re-usable partials handling.

PR Checklist

  • Closes Implement UTF-8 <--> UTF-16 conversion in user mode. #4092
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Requires documentation to be updated
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

On my list:

  1. ✔ Implement UTF-8 <--> UTF-16 conversion in user mode (this PR)
    1.1. ✔ Transpose my U8ToU16() and U16ToU8() C --> C++ (obsolet)
    1.2. ✔ Implement functors for partials handling
    1.3. ✔ Implement functors to do both the partials handling and the conversion task at once
    1.4. ✔ Supersede Utf8OutPipeReader and remove it from the code base to avoid further use
  2. ❌ Unify UTF-8 handling (supersede Utf8ToWideCharParser)
    2.1. ❌ Update VtInputThread::_HandleRunInput()
    2.2. ❌ Update ApiRoutines::WriteConsoleAImpl()
    2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
  3. ❌ Enable BOM discarding
    3.1. ❌ Implement an enum class containing flags for U8ToU16() and U16ToU8() to enable discarding both BOM and/or invalids
    3.2. ❌ Replace the 3rd parameter of U8ToU16(), U16ToU8(), and related functors with the enum and update the function code accordingly
    3.3. ❌ Make use of the 3rd parameter to discard the BOM in all current functor callers, or (optional / ask the core team) make it the default for U8ToU16() and U16ToU8()
  4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too

Validation Steps Performed

  • Unit tests implemented.
  • Loads of manual tests to evaluate the behavior of the implemented code.

@german-one
Copy link
Contributor Author

german-one commented Dec 31, 2019

Manual comparisons: https://github.com/german-one/u8u16

  • Note: All tests have been made with the same number (100,000) of code point U+20AC Euro Sign. It takes 3 UTF-8 code units and 1 UTF-16 code unit.
  • Note: The time measured includes the memory allocation (using make_unique for the API functions).
  • Note: All functions were called only once to perform the conversion of the whole string.

  • UTF-8 to UTF-16
MultiByteToWideChar
 length 100000000
 elapsed 2.23501
~~~~~
RtlUTF8ToUnicodeN
 NTSTATUS 0
 length 100000000
 elapsed 2.26018
~~~~~
U8ToU16
 HRESULT 0
 length 100000000
 elapsed 2.38592

  • UTF-16 to UTF-8
WideCharToMultiByte
 length 300000000
 elapsed 1.45804
~~~~~
RtlUnicodeToUTF8N
 NTSTATUS 0
 length 300000000
 elapsed 1.65566
~~~~~
U16ToU8
 HRESULT 0
 length 300000000
 elapsed 2.41888
~~~~~

@miniksa
Copy link
Member

miniksa commented Jan 3, 2020

  1. ✔ Implement UTF-8 <--> UTF-16 conversion in user mode (this PR)
    1.1. ✔ Transpose my U8ToU16() and U16ToU8() C --> C++
    1.2. ✔ Implement functors for partials handling
    1.3. ✔ Implement functors to do both the partials handling and the conversion task at once
    1.4. ✔ Supersede Utf8OutPipeReader and remove it from the code base to avoid further use
  2. ❌ Unify UTF-8 handling (supersede Utf8ToWideCharParser)
    2.1. ❌ Update VtInputThread::_HandleRunInput()
    2.2. ❌ Update ApiRoutines::WriteConsoleAImpl()
    2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
  3. ❌ Enable BOM discarding
    3.1. ❌ Implement an enum class containing flags for U8ToU16() and U16ToU8() to enable discarding both BOM and/or invalids
    3.2. ❌ Replace the 3rd parameter of U8ToU16(), U16ToU8(), and related functors with the enum and update the function code accordingly
    3.3. ❌ Make use of the 3rd parameter to discard the BOM in all current functor callers, or (optional / ask the core team) make it the default for U8ToU16() and U16ToU8()
  4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too

Do you intend to finish all of these in this PR, or is this PR just for 1. and you plan to open more PRs for the rest?

Manual comparisons: german-one/u8u16

You are welcome to submit this utility into our repository as well under the src\tools directory which is where we usually put things that we craft in the course of development to aid our work.

  • Note: All tests have been made with the same number (100,000) of code point U+20AC Euro Sign. It takes 3 UTF-8 code units and 1 UTF-16 code unit.
  • Note: The time measured includes the memory allocation (using make_unique for the API functions).
  • Note: All functions were called only once to perform the conversion of the whole string.
  • I presume you did this with an opt build of your code? I'm surprised that it's looking slower than calling out.
  • Can you do a batch of small conversions as a comparison instead of the one big conversion? The performance gain I'm expecting is removing the overhead of doing all the syscalls on conversion of tiny strings (1-10 characters long). If long strings are comparable and short strings many times are faster, then we're good and it works as I expect.
  • Did you run each test/situation multiple times and average the results?

@german-one
Copy link
Contributor Author

Do you intend to finish all of these in this PR, or is this PR just for 1. and you plan to open more PRs for the rest?

This is for item 1. I'll open a new PR for each of the main items because I'm afraid it's too much for only one PR and I don't want to overwhelm you with the code review. It's already a lot of new code. And I want to get your OK for the hand-rolled conversions first.

You are welcome to submit this utility into our repository as well under the src\tools directory which is where we usually put things that we craft in the course of development to aid our work.

Thanks for the hint! I'll do so.

I presume you did this with an opt build of your code? I'm surprised that it's looking slower than calling out.

Yes. And don't look at the absolute figures. I made the tests on my little Netbook which is slow as hell. But comfortable for coding on the couch 😁

Can you do a batch of small conversions as a comparison instead of the one big conversion? The performance gain I'm expecting is removing the overhead of doing all the syscalls on conversion of tiny strings (1-10 characters long). If long strings are comparable and short strings many times are faster, then we're good and it works as I expect.

OK, coming soon.

Did you run each test/situation multiple times and average the results?

Yes but the example I posted above is just a typical single result. They don't differ much.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this is going.

My major concern is that there are two big entrypoints for how you want to consume the conversions:

  1. If you want to just consume everything and not deal with partials, you can use the more elegant U8ToU16/U16ToU8 functions.
  2. If you need to deal with partials, you have to stand up the entire class of UTF8ChunkToUTF16Converter and then hold onto it and pass through that. Also it's an extremely verbose name that's a lot to type (and prone to typographical error).

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.

Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

For example:
std::wstring til::u8u16(std::string_view in, bool discardInvalid = false)
std::wstring til::u8u16(std::string_view in, til::u8state& state, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, bool discardInvalid = false)
std::string til::u16u8(std::wstring_view in, til::u16state& state, bool discardInvalid = false)

and the nothrow siblings
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u8u16(std::string_view in, std::wstring& out, til::u8state& state, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, bool discardInvalid = false) nothrow
[[nodiscard]] HRESULT til::u16u8(std::wstring_view in, std::string& out, til::u16state& state, bool discardInvalid = false) nothrow

could be the 8 public entry points into this conversion operation and are effective one-liners for all consumers.

Those who want to preserve partials across boundaries could provide the optional til::u8state or til::u16state which would just be public structs that held up to 3 utf8 leads or 1 utf16 high surrogate as applicable across calls.

til::u8state state;
while (true)
{
   std::array<byte, 4096> buffer;
   DWORD read;
   ... // read the thing
   const auto wideText = til::u8u16({buffer.data(), read}, state, true);
   ... // pass wideText on.
}

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

@DHowett-MSFT and I were talking about this verbally, Dustin what do you think of that as a quick-spec?

@german-one
Copy link
Contributor Author

I like where this is going.
My major concern is that there are two big entrypoints for how you want to consume the conversions:

If you want to just consume everything and not deal with partials, you can use the more elegant U8ToU16/U16ToU8 functions.
If you need to deal with partials, you have to stand up the entire class of UTF8ChunkToUTF16Converter and then hold onto it and pass through that. Also it's an extremely verbose name that's a lot to type (and prone to typographical error).

I see your point. My intention was to have it all modules that can be combined new if you need a more specialized class in future. But I agree that this might never be the case if I write proper overloads for the conversions like your proposals.

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.
Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

As I understood "til" is for "Terminal Implementation Library". Is this still OK for code that is intended to get shared with conhost later?

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

In the throwing code we will loose the information whether or not invalid codepoints have been found. That is probably no problem at all. Just a heads up.

@german-one
Copy link
Contributor Author

german-one commented Jan 5, 2020

I followed @miniksa 's suggestion to commit U8U16Test to the tools. There is comment TEST TOOL U8U16Test at the beginning of these files that hopefully makes it easier for you to ignore them in the code reviews.

Furthermore I found and removed a game stopper in the code. I thought I was smart and called a .shrink_to_fit() at the end of the main conversion functions to save memory space. That didn't have any big influence as long as the whole string was converted at once. But in subsequent calls of the functions it caused an huge loss of performance.

I still owe you the test results of the subsequent conversions of small strings.
The ..._WholeString tests still show the elapsed time of the conversion of the whole string at once.
The ..._Chuncs tests convert the same amount of code points in total but only 10 at a time.
(Ignore the ignore me value 😉 This is just a randomly picked character in order to ensure the compiler doesn't opt out the actual string creation.)

### UTF-8 To UTF-16 ###

~~~
test "MultiByteToWideChar_WholeString"
 ignore me 8364
 length 100000000
 elapsed 2.26513

~~~
test "RtlUTF8ToUnicodeN_WholeString"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 2.30559

~~~
test "u8u16_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.03403

~~~
test "MultiByteToWideChar_Chunks"
 ignore me 8364
 length 100000000
 elapsed 5.43452

~~~
test "RtlUTF8ToUnicodeN_Chunks"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 5.12312

~~~
test "u8u16_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 3.79081
### UTF-16 To UTF-8 ###

~~~
test "WideCharToMultiByte_WholeString"
 ignore me 130
 length 300000000
 elapsed 1.61061

~~~
test "RtlUnicodeToUTF8N_WholeString"
 ignore me 172
 NTSTATUS 0
 length 300000000
 elapsed 1.60669

~~~
test "u16u8_WholeString"
 ignore me 172
 HRESULT 0
 length 300000000
 elapsed 2.55012

~~~
test "WideCharToMultiByte_Chunks"
 ignore me 172
 length 300000000
 elapsed 4.40411

~~~
test "RtlUnicodeToUTF8N_Chunks"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 3.84753

~~~
test "u16u8_Chunks"
 ignore me 226
 HRESULT 0
 length 300000000
 elapsed 3.94672

EDIT: There was a quirk in my test loop. Updated the figures above.

I think I sorted it all out to be ready for the next code review.

@zadjii-msft zadjii-msft added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Product-Meta The product is the management of the products. labels Jan 6, 2020
@miniksa
Copy link
Member

miniksa commented Jan 6, 2020

I'd rather it be a more uniform calling pattern and I'd prefer it if it were more elegant like the 1st one as that reminds me more of how things like STD/WIL/GSL implement it.
Finally, this could be a good candidate of things to join our "TIL" namespace that we just created earlier this week.

As I understood "til" is for "Terminal Implementation Library". Is this still OK for code that is intended to get shared with conhost later?

Yes, that's fine.

For exception-ready code, the returns can go straight into a nice const.
For non-exception ready code, the return can be a code and the output comes out a reference.

In the throwing code we will loose the information whether or not invalid codepoints have been found. That is probably no problem at all. Just a heads up.

Lose what information specifically? The "whether or not there was an invalid codepoint"?

As long as it doesn't look like it will influence the way that we write code (that is, the point of the throwing ones is so we can write const auto u8 = til::u16u8(u16, true); and be reasonably assured we're getting what we need into a const std::string... then I'm good with it.

(... pretend this is a quote of the testing results ...)

The u8u16 ones look great. Excellent.
I'm a bit concerned about the u16u8 whole string performance loss. The chunks look comparable, though. Do we know what causes that?

@german-one
Copy link
Contributor Author

Yes, that's fine.

OK, I'll move the code into the til namespace then.

Lose what information specifically? The "whether or not there was an invalid codepoint"?

As long as it doesn't look like it will influence the way that we write code (that is, the point of the throwing ones is so we can write const auto u8 = til::u16u8(u16, true); and be reasonably assured we're getting what we need into a const std::string... then I'm good with it.

The true is for ignoring invalid codepoints. They are not replaced with U+FFFD and just left out. The non-throwing ones would still return S_FALSE if an invalid character has been found. But it wouldn't make sense to throw in this case, would it?

The u8u16 ones look great. Excellent.
I'm a bit concerned about the u16u8 whole string performance loss. The chunks look comparable, though. Do we know what causes that?

Unfortunately I don't know why. Out of my experiences I can tell that iterations in steps of 16 bits is slower than steps of 32 bits. So my assumption is that the API functions process the string in steps of register size. That might be the reason why WideCharToMultiByte is noticeable faster than MultiByteToWideChar.

@fcharlie
Copy link
Contributor

fcharlie commented Jan 7, 2020

@miniksa @german-one I think the reason why u16u8 is slow may be caused by push_back, push_back still needs to check whether the buffer is large enough, and each call still needs to append NULL truncated string

One solution is to resize the temporary std :: string to 3 * wstr.size () and then use str.data () (non const) directly as the buffer, and then use resize to correct the string length .

This solution may consume more memory.

There are related articles here for reference to achieve faster UTF-16 <-> UTF-8 conversion.

CppCon 2018: Bob Steagall “Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics”

UTF-8 processing using SIMD (SSE4)

https://github.com/cyb70289/utf8

@german-one
Copy link
Contributor Author

One solution is to resize the temporary std :: string to 3 * wstr.size () and then use str.data () (non const) directly as the buffer, and then use resize to correct the string length .

I'll perform some tests with your proposal tomorrow. Thanks!

This solution may consume more memory.

I don't think so. I already use .reserve() in the current code to allocate memory for three times as many code units. This was a performance improvement of ~30%.

@fcharlie
Copy link
Contributor

fcharlie commented Jan 7, 2020

@german-one Using pointers directly must ensure that the buffer is large enough. In addition, I suggest you use explicit types such as char32_t char16_t in your code.

@german-one
Copy link
Contributor Author

@german-one Using pointers directly must ensure that the buffer is large enough.

Of course. Factor 3 is the worst case. so that's good enough.

In addition, I suggest you use explicit types such as char32_t char16_t in your code.

No. Those are least types. No idea how those types came into the standard. We need types with exact size. And I prefer to use the value_type of the wstring because that's what we compose out of the values.

@german-one
Copy link
Contributor Author

@fcharlie I added pointer versions to the test tool. The outcome is surprising and somehow unexpected. While using pointers works great for the UTF-16 to UTF-8 conversion, it makes things worse in the UTF-8 to UTF-16 conversion. That's how the results look like:

### UTF-8 To UTF-16 ###

~~~
test "MultiByteToWideChar_WholeString"
 ignore me 8364
 length 100000000
 elapsed 2.26531

~~~
test "RtlUTF8ToUnicodeN_WholeString"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 2.27432

~~~
test "u8u16_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.04221

~~~
test "u8u16_ptr_WholeString"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 2.41866

~~~
test "MultiByteToWideChar_Chunks"
 ignore me 8364
 length 100000000
 elapsed 5.4539

~~~
test "RtlUTF8ToUnicodeN_Chunks"
 ignore me 8364
 NTSTATUS 0
 length 100000000
 elapsed 5.11386

~~~
test "u8u16_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 3.79609

~~~
test "u8u16_ptr_Chunks"
 ignore me 8364
 HRESULT 0
 length 100000000
 elapsed 4.10265
### UTF-16 To UTF-8 ###

~~~
test "WideCharToMultiByte_WholeString"
 ignore me 130
 length 300000000
 elapsed 1.60714

~~~
test "RtlUnicodeToUTF8N_WholeString"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 1.61687

~~~
test "u16u8_WholeString"
 ignore me 226
 HRESULT 0
 length 300000000
 elapsed 2.59101

~~~
test "u16u8_ptr_WholeString"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 1.4677

~~~
test "WideCharToMultiByte_Chunks"
 ignore me 226
 length 300000000
 elapsed 4.55823

~~~
test "RtlUnicodeToUTF8N_Chunks"
 ignore me 130
 NTSTATUS 0
 length 300000000
 elapsed 4.08254

~~~
test "u16u8_Chunks"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 3.93425

~~~
test "u16u8_ptr_Chunks"
 ignore me 130
 HRESULT 0
 length 300000000
 elapsed 3.64682

@miniksa Do you think I should commit the pointer version for til::u16u8?

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miniksa Michael Niksa FTE Do you think I should commit the pointer version for til::u16u8?

As long as all the pointers are managed internally to the function and you have sufficient bounds checking, then yes. We should do whatever it takes to squeeze max performance out of these little functions. Leave the comment behind explaining the decision, though, inside the code.

@german-one
Copy link
Contributor Author

@miniksa Thanks for your guidance. One UTF-16 code unit can become three UTF-8 code units in the worst case. The string::resize() throws if it fails which is cought. I left comments why pointers are used and suppressed warnings in this code area for the code analysis.

@fcharlie
Copy link
Contributor

fcharlie commented Jan 8, 2020

@german-one I took a look at your test code. The variable testU16 contains only the code point u20AC. I suggest that you use more diverse test data so that it can be more in line with the actual situation.

The benchmark difference may also be caused by branch prediction, caching, etc. Excessive if-else may be slow, and there is an article about this problem: A Branchless UTF-8 Decoder
So we usually use the lookup table method to get the string length of UTF-8 Unicode code points, which is available in the LLVM source code: trailingBytesForUTF8 (this is also standard practice at unicode.org).

u8-> u16 encoding conversion, many people have done research, you can go to see their article or source code (reactOS source code based on GPLv2 it is necessary to check), maybe a better idea.

In addition, you can try to use SIMD and other instructions to achieve encoding conversion.

@german-one
Copy link
Contributor Author

german-one commented Jan 8, 2020

@fcharlie I already had a look at the links you previously posted and other examples. And I took in consideration to use lookup tables to avoid unnecessary branching and expensive calculations. Now that the code is moved into the til namespace I have better options to outsource auxiliary code into a details or internal sub-namespace. Also I agree with you that more tests are necessary, especial using different natural languages. Right now it's just a quick assessment to get an idea how it behaves compared with the API functions and if it would even make sense to implement an own conversion. And as it seems we would already gain a big benefit with the the current u8u16 function. Consider what we are still doing with the API functions. E.g. call it once to figure out if it fails and then crawl manually over the string to remove invalid characters. Or, call it once to find out how much memory space is required and then run it a second time to perform the actual conversion.

@miniksa I read about difficulties you will probably get to determine string lengths if it contains surrogate pairs in #780. You may have already seen that the conversions between the UTF encodings always require the calculation of the code point in UTF-32 representation. It would be quite simple to have additional functions that convert from and to a u32string. I could imagine that consuming UTF-32 is much easier because one character is always represented by one UTF-32 code unit. Any thoughts on that?

@miniksa
Copy link
Member

miniksa commented Jan 8, 2020

@german-one, I'm hesitant on the u32 versions. I'm concerned that:

  1. Converting everything further from u16 to u32 for the sake of dealing with u16 surrogates is going to make the "fast case" exceptionally slow.
  2. Moving everything to u32 is going to consume way too much memory such as to only do the conversion once. (u8u32 instead of u8u16 then u16u32).

I think the compromise I was going to go for is simply checking the lead surrogate range in the stream write.

@german-one
Copy link
Contributor Author

@miniksa OK no problem. I just wanted to note that we are ready for that conversation, too.

@zadjii-msft
Copy link
Member

@msftbot make sure @miniksa signs off on this

@ghost ghost added the AutoMerge Marked for automatic merge by the bot when requirements are met label Jan 9, 2020
@ghost
Copy link

ghost commented Jan 9, 2020

Hello @zadjii-msft!

Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:

  • I'll only merge this pull request if it's approved by @miniksa

If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you".

@zadjii-msft
Copy link
Member

zadjii-msft commented Jan 9, 2020

@msftbot merge this in 168 hours

(this is to make sure this doesn't automerge before the 0.8 build snap. The rest of team can override me on this if they want)

@ghost
Copy link

ghost commented Jan 9, 2020

Hello @zadjii-msft!

Because you've given me some instructions on how to help merge this pull request, I'll be modifying my merge approach. Here's how I understand your requirements for merging this pull request:

  • I'll only merge this pull request if it's approved by @miniksa
  • I won't merge this pull request until after the UTC date Thu, 16 Jan 2020 16:54:50 GMT, which is in 7 days

If this doesn't seem right to you, you can tell me to cancel these instructions and use the auto-merge policy that has been configured for this repository. Try telling me "forget everything I just told you".

Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I don't totally understand u8/u16 conversion, but I trust that it's just bit magic like this. I think I'm okay with this. I got a little confused by the U8U16Test project at first, but I think I get it now. My comments aren't totally blocking IMO. I'm excited about the potential perf gains we can get from applying this in other places as well.

Thanks for the hard work!

@german-one
Copy link
Contributor Author

PHEW 😅
I repeated all the previous steps. No idea where I was going wrong before.

FWIW Of course I trust you. And I'm sure you would have managed everything in no time. But I prefer to go the hard way in order to learn something new. Thank you so much for your git lessons and your patience! You've been a great help.

@lhecker
Copy link
Member

lhecker commented Jan 18, 2020

@german-one Nice on! 👍 And... No problem of course. 😄

By the way, since I've seen many who don't know this: If you force-push something to GitHub, it'll show as:

$name force-pushed the $name:$branch branch from $old to $new $age minutes ago

The "force-pushed" part in there is a link and clickable and can be used to quickly ensure what has changed during a rebase by yourself or someone else (if you ever need to review some code).

@german-one
Copy link
Contributor Author

@miniksa Sorry for all that. For now I will definitely refrain from further experiments like this.
Do you want me to file a follow up for incorporating the chromium safe math here? Maybe I should wait until we have it in the til namespace anyway.

@miniksa
Copy link
Member

miniksa commented Jan 18, 2020

@miniksa Sorry for all that. For now I will definitely refrain from further experiments like this.
Do you want me to file a follow up for incorporating the chromium safe math here? Maybe I should wait until we have it in the til namespace anyway.

@german-one do not be sorry at all. Do experiments as you wish. I definitely learned plenty from this one and you saved me the time of figuring this out myself one day in the future. I really appreciate it even if it will turn out more limited than we initially estimated. Your effort was extremely worthwhile.

Maybe file another task for safe math because @skyline75489 is still helping me solve that experiment in another PR.

@german-one
Copy link
Contributor Author

german-one commented Jan 18, 2020

I filed #4290 for the safe math.

@german-one do not be sorry at all. Do experiments as you wish.

That was rather related to my recent git experiments 😆 You probably didn't see the havoc I caused when I tried to update my fork with all the PRs that landed meanwhile. @lhecker helped a lot to sort this out.

But yea, I tried hard to never leave you in the dark. This led the PR becoming a long thread of comments and additional code, but my intention was to give you the opportunity to follow all of the steps.

@german-one
Copy link
Contributor Author

german-one commented Jan 22, 2020

@miniksa The point that I was not able to address yet is whether or not the files should be moved into the src\inc\til folder. Your concern was that the implementation is in a .cpp file which doesn't fit well. My idea is to move them into src\til and add Utf8Utf16Convert.hpp to the includes in src\inc\til.h. What do you think?

@miniksa
Copy link
Member

miniksa commented Jan 22, 2020

@miniksa Michael Niksa FTE The point that I was not able to address yet is whether or not the files should be moved into the src\inc\til folder. Your concern was that the implementation is in a .cpp file which doesn't fit well. My idea is to move them into src\til and add Utf8Utf16Convert.hpp to the includes in src\inc\til.h. What do you think?

@DHowett-MSFT, if it is all in the header file, the linker will take care of it, right? We don't technically need it in a LIB/CPP?

@german-one
Copy link
Contributor Author

@miniksa I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

FWIW I can't tell why the x86 build fails. Seems to be unrelated to these updates.

@miniksa
Copy link
Member

miniksa commented Jan 27, 2020

@miniksa Michael Niksa FTE I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

FWIW I can't tell why the x86 build fails. Seems to be unrelated to these updates.

Looks like x86 succeeded this time.

@miniksa
Copy link
Member

miniksa commented Jan 27, 2020

@miniksa Michael Niksa FTE I made it all templates in the header and moved it into ./src/inc/til/. It looks a little clumsy with the enable_ifs though. Do you have any better idea?

Also, I don't have a better idea. But @DHowett-MSFT might. However, slightly clumsy templates are better than slightly clumsy other-code. So I'm inclined to call this acceptable because it should make the rest of the code look very nice and clean.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'd still like @DHowett-MSFT to see it before we commit it, though.

@german-one
Copy link
Contributor Author

Thanks @miniksa !

@DHowett-MSFT I guess you are afraid of the 2.2 k of new lines of code to review 😉 Don't worry, it's not that bad. It's rather only the ~450 lines in u8u16convert.h. Most of the rest belongs to the testing tool.

Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this. Thank you. The templates look great!

@DHowett-MSFT DHowett-MSFT merged commit 32ea419 into microsoft:master Jan 30, 2020
DHowett-MSFT pushed a commit that referenced this pull request Feb 4, 2020
Replace `utf8Parser` with `til::u8u16` in order to have the same
conversion algorithms used in terminal and conhost.

This PR addresses item 2 in this list:
1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093)
2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)**
    2.1. ✔ **Update VtInputThread::_HandleRunInput()**
    2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()**
    2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
3. ❌ Enable BOM discarding (follow up)
    3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM
    3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for  `til::u8u16` and `til::u16u8` 
4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)

Closes #4086
Closes #3378
@DHowett-MSFT
Copy link
Contributor

🎉 Once again, thanks for the contribution!

This pull request was included in a set of conhost changes that was just
released with Windows Insider Build 19603.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Product-Meta The product is the management of the products.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement UTF-8 <--> UTF-16 conversion in user mode.

6 participants