Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ascii in fgVNBasedIntrinsicExpansionForCall_ReadUtf8 #89383

Merged
merged 10 commits into from
Jul 25, 2023

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jul 24, 2023

Jakob pointed me to WszWideCharToMultiByte API that seems to be enough to support non-ASCII values here too.
Closes #89375

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 24, 2023
@ghost ghost assigned EgorBo Jul 24, 2023
@ghost
Copy link

ghost commented Jul 24, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Jakob pointed me to WszWideCharToMultiByte API that seems to be enough to support non-ASCII values here too.
Closes #89375

Author: EgorBo
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo EgorBo marked this pull request as ready for review July 24, 2023 13:46
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

@EgorBo
Copy link
Member Author

EgorBo commented Jul 24, 2023

Wrote a test to validate WszWideCharToMultiByte behaviour: https://gist.github.com/EgorBo/f65f1fd468a8644d33ac5224099d84d6 to match BCL's conversion

@stephentoub
Copy link
Member

What does WszWideCharToMultiByte do for invalid input?

@EgorBo
Copy link
Member Author

EgorBo commented Jul 24, 2023

What does WszWideCharToMultiByte do for invalid input?

Can you provide an example - I'm a bit confused on what's expected to be done with it

@stephentoub
Copy link
Member

Can you provide an example - I'm a bit confused on what's expected to be done with it

For example, this will throw an exception:

new UTF8Encoding(false, throwOnInvalidBytes: true).GetBytes("\uD800b")

because \uD800 is a high surrogate and it's not followed by a low surrogate.

Ideally the JIT simply wouldn't do anything if it detected the text was invalid and would just use the actual implementation of ReadUtf8, which would let UTF8Encoding do with it what it will, and you wouldn't need to match the handling for erroneous input.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 24, 2023

Thanks, it seems that it still returns the same value as the BCL version, there is a WC_ERR_INVALID_CHARS flag for WszWideCharToMultiByte that makes it to return 0 for such input and forward it to the managed impl but not sure it's supported outside of Windows

I was also pointed to a minipal version for utf16-utf8 encoding (minipal_convert_utf16_to_utf8) that is used for evenstream, perhaps, I should use that instead? (+ extra size to clrjit I guess) cc @jkotas @elinor-fung (from #89036)

@am11
Copy link
Member

am11 commented Jul 24, 2023

WszWideCharToMultiByte is (one of many) wrapper(s) around WideCharToMultiByte from coreclr/pal, which in turn is a wrapper around minipal/utf8. I think the consensus with @jkotas in #85558 was that we remove these PAL wrappers and directly use minipal APIs in both coreclr and mono. But as it stands, all native code in runtime repo is using the same minipal APIs for utf16<->utf8 conversion.

ps [lowkey]: there is a room to vectorize that minipal/utf8 C implementation, matching the C# one, to accelerate both runtimes paths which frequently convert strings.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 24, 2023

I'm no expert in cmake, I tried to add utf8.c from minipal to jit sources and I've got:

  C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.37.32820\include\yvals_core.h(28): STL1003: Unexpected compiler, expected C++ compiler.
  C:\Program Files\Microsoft Visual Studio\2022\Preview\VC\Tools\MSVC\14.37.32820\include\yvals_core.h(29): fatal error C1189: #error:  Error in C++ Standard Library usage

🙁 how do I include a .c file to C++ codebase?

@am11
Copy link
Member

am11 commented Jul 24, 2023

@EgorBo, seems like in "jit/static", we need to disable PCH for C sources. With this patch, build succeeds with VS 2022 (17.6.3):

diff --git a/src/coreclr/jit/CMakeLists.txt b/src/coreclr/jit/CMakeLists.txt
index d00b5b27fe5..daaf1aa1e38 100644
--- a/src/coreclr/jit/CMakeLists.txt
+++ b/src/coreclr/jit/CMakeLists.txt
@@ -176,6 +176,7 @@ set( JIT_SOURCES
   unwind.cpp
   utils.cpp
   valuenum.cpp
+  ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c
 )

 if (CLR_CMAKE_TARGET_WIN32)
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index e752044acf8..38664a91e66 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -48,6 +48,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 #include "disasm.h"
 #endif

+#include <minipal/utf8.h>
+
 #include "codegeninterface.h"
 #include "regset.h"
 #include "jitgcinfo.h"
@@ -136,6 +138,7 @@ const int BAD_STK_OFFS = 0xBAADF00D; // for LclVarDsc::lvStkOffs
 //------------------------------------------------------------------------
 inline bool IsHfa(CorInfoHFAElemType kind)
 {
+    size_t test = minipal_get_length_utf8_to_utf16 (0, 0, 0);
     return kind != CORINFO_HFA_ELEM_NONE;
 }
 inline var_types HfaTypeFromElemKind(CorInfoHFAElemType kind)
diff --git a/src/coreclr/jit/static/CMakeLists.txt b/src/coreclr/jit/static/CMakeLists.txt
index 99ae15963b5..5a8052765cc 100644
--- a/src/coreclr/jit/static/CMakeLists.txt
+++ b/src/coreclr/jit/static/CMakeLists.txt
@@ -2,6 +2,10 @@ project(ClrJit)

 set_source_files_properties(${JIT_EXPORTS_FILE} PROPERTIES GENERATED TRUE)

+if(CLR_CMAKE_HOST_WIN32)
+set_source_files_properties("${CLR_SRC_NATIVE_DIR}/minipal/utf8.c" PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
+endif(CLR_CMAKE_HOST_WIN32)
+
 add_library_clr(clrjit_obj
     OBJECT
     ${JIT_SOURCES}

@jakobbotsch
Copy link
Member

Can we just stick to WszWideCharToMultiByte that is already used in other places in the JIT? We can convert all of them to use minipal as a separate PR.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2023

@EgorBo, seems like in "jit/static", we need to disable PCH for C sources. With this patch, build succeeds with VS 2022 (17.6.3):

diff --git a/src/coreclr/jit/CMakeLists.txt b/src/coreclr/jit/CMakeLists.txt
index d00b5b27fe5..daaf1aa1e38 100644
--- a/src/coreclr/jit/CMakeLists.txt
+++ b/src/coreclr/jit/CMakeLists.txt
@@ -176,6 +176,7 @@ set( JIT_SOURCES
   unwind.cpp
   utils.cpp
   valuenum.cpp
+  ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c
 )

 if (CLR_CMAKE_TARGET_WIN32)
diff --git a/src/coreclr/jit/compiler.h b/src/coreclr/jit/compiler.h
index e752044acf8..38664a91e66 100644
--- a/src/coreclr/jit/compiler.h
+++ b/src/coreclr/jit/compiler.h
@@ -48,6 +48,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 #include "disasm.h"
 #endif

+#include <minipal/utf8.h>
+
 #include "codegeninterface.h"
 #include "regset.h"
 #include "jitgcinfo.h"
@@ -136,6 +138,7 @@ const int BAD_STK_OFFS = 0xBAADF00D; // for LclVarDsc::lvStkOffs
 //------------------------------------------------------------------------
 inline bool IsHfa(CorInfoHFAElemType kind)
 {
+    size_t test = minipal_get_length_utf8_to_utf16 (0, 0, 0);
     return kind != CORINFO_HFA_ELEM_NONE;
 }
 inline var_types HfaTypeFromElemKind(CorInfoHFAElemType kind)
diff --git a/src/coreclr/jit/static/CMakeLists.txt b/src/coreclr/jit/static/CMakeLists.txt
index 99ae15963b5..5a8052765cc 100644
--- a/src/coreclr/jit/static/CMakeLists.txt
+++ b/src/coreclr/jit/static/CMakeLists.txt
@@ -2,6 +2,10 @@ project(ClrJit)

 set_source_files_properties(${JIT_EXPORTS_FILE} PROPERTIES GENERATED TRUE)

+if(CLR_CMAKE_HOST_WIN32)
+set_source_files_properties("${CLR_SRC_NATIVE_DIR}/minipal/utf8.c" PROPERTIES SKIP_PRECOMPILE_HEADERS ON)
+endif(CLR_CMAKE_HOST_WIN32)
+
 add_library_clr(clrjit_obj
     OBJECT
     ${JIT_SOURCES}

Thank you for educting me on this part! Agree on the point that since we already use it I'll delay that to a separate pr

@jkotas
Copy link
Member

jkotas commented Jul 25, 2023

Thanks, it seems that it still returns the same value as the BCL version

Do we need tests for this?

Handling of invalid UTF8/UTF16 sequences is always a mess. I would be surprised if there are no subtle behavior differences between the managed Utf8 encoding implementation, the minipal Utf8 encoding implementation, and the Windows OS provided Utf8 encoding implementation that WszWideCharToMultiByte ends up calling on Windows.

@jkotas
Copy link
Member

jkotas commented Jul 25, 2023

how do I include a .c file to C++ codebase?

I think you would want to add utf8.c to https://github.com/dotnet/runtime/blob/main/src/coreclr/minipal/Windows/CMakeLists.txt. (Adding it to JIT sources would collide with utf8.c that is part of Win32 PAL that the JIT still links with.)

Can we just stick to WszWideCharToMultiByte that is already used in other places in the JIT? We can convert all of them to use minipal as a separate PR.

WszWideCharToMultiByte has different underlying implementation on Windows vs. Unix, so it will expose you to OS-specific bugs. The existing uses of WszWideCharToMultiByte do not influence codegen directly. Something to consider.

@am11
Copy link
Member

am11 commented Jul 25, 2023

I would be surprised if there are no subtle behavior differences between the managed Utf8 encoding implementation, the minipal Utf8 encoding implementation, and the Windows OS provided Utf8 encoding implementation that WszWideCharToMultiByte ends up calling on Windows.

Mono uses giconv->minipal on both Windows and Unix alike. Perhaps this is something we can consider for coreclr in the future, as a step to mitigate these differences?

Handling of invalid characters between minipal and managed implementation is pretty much the same (if not exactly the same). There are PAL and COM tests exercising the invalid codepoints; one slightest change in behavior and they fail. (during C++ to C conversion, I had to debug them quite a bit to achieve the precise behavior the tests were expecting; so if there still are differences in error modes, they should be minimum as a result of oversight / test gap)

@jkotas
Copy link
Member

jkotas commented Jul 25, 2023

To fix the stdbool build break, add a dummy stdbool.h to runtime\src\coreclr\pal\inc\rt\cpp, similar to other std*.h that are already there.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2023

To fix the stdbool build break, add a dummy stdbool.h to runtime\src\coreclr\pal\inc\rt\cpp, similar to other std*.h that are already there.

Thanks, that works!

@EgorBo
Copy link
Member Author

EgorBo commented Jul 25, 2023

@jkotas does it look good now?

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@EgorBo EgorBo merged commit c33b304 into dotnet:main Jul 25, 2023
@EgorBo EgorBo deleted the utf16-utf8-non-ascii branch July 25, 2023 17:59
@ghost ghost locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fgVNBasedIntrinsicExpansionForCall_ReadUtf8 supports only ASCII
5 participants