Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL #3522

Closed
wants to merge 1 commit into from
Closed

Conversation

wtgodbe
Copy link
Member

@wtgodbe wtgodbe commented Mar 4, 2016

I ported the algorithm from src\mscorlib\src\System\Text\UTF8Encoding.cs to src\pal\src\locale\utf8.cpp, for consistency/performance reasons. Also added a couple of tests. See #1725 for some conversation on this issue.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 4, 2016

@janvorli @sergiy-k PTAL

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 5, 2016

@dotnet-bot test this please.

@adityamandaleeka
Copy link
Member

I think you need to update the license headers to the new one we're using.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 5, 2016

@adityamandaleeka which files need to be changed? And where can I find the new header?

@janvorli
Copy link
Member

janvorli commented Mar 7, 2016

@wtgodbe Aditia meant the added test files where there is the old license header. Look e.g. here for the current header:
https://github.com/dotnet/coreclr/blob/master/src/pal/inc/pal.h

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 7, 2016

Alright, made the change to the license headers.

@janvorli
Copy link
Member

janvorli commented Mar 7, 2016

@wtgodbe since github doesn't allow me to comment on the utf8.cpp file, let me put my comments here:

  1. Please don't use STL (string / vector). We don't use it anywhere in the PAL, dynamic memory allocation with "new" operator in the PAL is a problem and I guess the STL would use that. It seems that it should be quite simple to change the code to not to use it.
  2. In the UTF8ToUnicode and UnicodeToUTF8 functions, you swallow the ArgumentException from the GetChars / GetBytes. I would add abort to the catch, since such failure is fatal and means there is a bug in the decoder and the caller would get broken result.
  3. The UTF8ToUnicode and UnicodeToUTF8 functions should consistently set last error in case of all errors, now it happens only for ERROR_INSUFFICIENT_BUFFER.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 8, 2016

@janvorli Thanks for the suggestions. I've completed the first one, working on the second 2 now.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 8, 2016

@janvorli how do I know what errors to set last error to?

@janvorli
Copy link
Member

janvorli commented Mar 9, 2016

@wtgodbe looking at MSDN doc for WideCharToMultiByte / MultiByteToWideChar as a source of inspiration, I can see the following errors:
ERROR_INSUFFICIENT_BUFFER. A supplied buffer size was not large enough, or it was incorrectly set to NULL.
ERROR_INVALID_FLAGS. The values supplied for flags were not valid.
ERROR_INVALID_PARAMETER. Any of the parameter values was invalid.
ERROR_NO_UNICODE_TRANSLATION. Invalid Unicode was found in a string.

So we can use these, it looks like translating DecoderFallbackException / EncoderFallbackException to ERROR_NO_UNICODE_TRANSLATION and the ArgumentException to ERROR_INVALID_PARAMETER would make sense.

@janvorli
Copy link
Member

janvorli commented Mar 9, 2016

Now that I look at the change again, It seems we can get rid of the body of the ThrowLastBytesRecursive with the exception of the throw. We don't use or even store the exception messages, so there is no point in carefully constructing it and then effectively throwing it away.

And one nit - it would make sense to add a notice that this is a port of the C# version from mscorlib to the file header.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 9, 2016

Good point on ThrowLastBytes. Changes have been made.

@janvorli
Copy link
Member

janvorli commented Mar 9, 2016

@wtgodbe There is no need to set the last error before the abort(), the process will exit right away and nothing will be able to extract it.
But I would add catch (const DecoderFallbackException&) and catch (const EncoderFallbackException&) to the try with the GetByteCount and set the last error to ERROR_NO_UNICODE_TRANSLATION.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 9, 2016

Here's the results of running a performance benchmark before & after my change (ran with Release bits) - looks like there's a very slight performance increase of about 1%, which seems pretty negligible (although the deviation in results for each iteration was small enough to indicate that the change is real and not noise). I let each run for 10 iterations before taking an average.

master - 7.5294 s
utf8 - 7.4741 s

Also @janvorli do you have any idea why the build would be failing on non-windows (although it seems to build fine on FreeBSD, then give silent errors on 2 UTF tests). Looks like the following is the error:

10:43:19 Generating native image for mscorlib.
10:43:22 Microsoft (R) CoreCLR Native Image Generator - Version 4.5.22220.0
10:43:22 Copyright (c) Microsoft Corporation. All rights reserved.
10:43:22
10:43:22 ./build.sh: line 225: 20584 Aborted (core dumped) $__BinDir/crossgen $__BinDir/mscorlib.dll
10:43:22 Failed to generate native image for mscorlib.

@janvorli
Copy link
Member

janvorli commented Mar 9, 2016

@wtgodbe, unrelated of the issues you are seeing, there is a problem with the order in which you catch the exceptions. You need to catch the fallback exceptions before the ArgumentException, since otherwise the fallback ones would always be caught as ArgumentException.

I'm not sure about the crossgen issue - I assume it works fine on your local machine, right?
As for the failing tests on FreeBSD, it seems that the only silent abort would be the one that you have in the catch in the UnicodeToUTF8 / UTF8ToUnicode. Maybe you can add some debug prints there to see.

I've also noticed that you have SetLastError in the DecoderExceptionFallbackBuffer::Fallback / EncoderExceptionFallbackBuffer::Fallback.. Since the whole decoder / encoder is exception based, please remove it from there - we set it in the catch for the exception now, which is a better place.

If the silent failure is the abort in the catch, I wonder if the GetByteCount ever throws the EncoderFallbackException / DecoderFallbackException. Maybe these are thrown just when running the GetBytes.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 9, 2016

@janvorli it looks like I do have the crossgen failure on my machine, I just didn't notice it before because I was still able to build and run PAL tests. Now that I've added a couple debug print statements, I'm seeing that we hit an argument exception in getChars immediately before the crossgen failure.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 11, 2016

@janvorli I think I've found the problem - the ArgumentException is getting triggered somewhere with the message "Ran out of buffer space". Looking at the code, it looks like I replaced a call to a function named "ThrowCharsOverflow" in the original code, with an ArgumentException for a char overflow in my port. ThrowCharsOverflow appears to only throw an exception when nothing has been decoded yet (that is, when chars == charStart, which means we haven't advanced our pointer to the src string yet). I'm thinking I should just add an equivalent function to my port, what do you think?

@janvorli
Copy link
Member

@wtgodbe yes, I would keep the port as close to the original code as possible.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 11, 2016

@janvorli are you sure that we should be aborting on ArgumentExceptions? This means we're aborting when we have insufficient buffer space, which is what's causing the failure in Test 3 for WideCharToMultiByte (and vice-versa)

@janvorli
Copy link
Member

@wtgodbe Ah, I have not realized that the GetBytes have additional failure modes. Then we should handle the exceptions from the GetBytes the same way as we handle the ones from the GetByteCount. Just have a single try / catch there and put both the GetByteCount and GetBytes into it.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

@dotnet-bot test this please

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

@janvorli is there an alternative to wcslen in PAL for OSX? It looks like the call to wcslen on line 888 in utf8.cpp is returning an unexpectedly high value on OSX, causing the test failure (it returns 15 on OSX there, 3 on Linux for the same string)

@janvorli
Copy link
Member

I guess the problem is that the string returned by fallback->GetDefaultString is not zero terminated. It is just a buffer for up-to 2 wide chars. It seems you need to use the fallback->GetMaxCharCount() instead of the wcslen

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Should it still be 2 * GetMaxCharCount(), or just GetMaxCharCount()?

@janvorli
Copy link
Member

Hmm, it seems I was wrong, the string returned from the GetDefaultString should be zero terminated after all, unless there is a bug - e.g. trying to assign string longer than 1 character to it.
Btw, this

EncoderReplacementFallback((WCHAR*)L"?")

Should be

EncoderReplacementFallback(W("?"))

instead. It would not cause the issue though.

@janvorli
Copy link
Member

And this is wrong, maybe that's the real issue causing the trouble:

2 * wcslen((const wchar_t *)fallback->GetDefaultString());

PAL doesn't use wchar_t, which is 32 bit char on Unix, but WCHAR which is 16 bit char. So you get a pointer to 16 bit char and cast it to a pointer to 32 bit char. You need to use PAL_wcslen.

@janvorli
Copy link
Member

Or the fallback->GetMaxCharCount(), of course

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Should I switch just

EncoderReplacementFallback((WCHAR*)L"?")

Or should I also switch

DecoderReplacementFallback((WCHAR*)L"?")

?

@janvorli
Copy link
Member

You should also not have the #include <wchar.h> in the file.
And I also wonder if we need these either:

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>

@janvorli
Copy link
Member

You shoyuld change all places that use L"xx" to W("xx") and remove the casts.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Are you sure about switching L"xx" to W("xx")? I'm getting compiler errors in those places now with the following:

candidate function not viable: no known conversion from 'const wchar_t *' to 'const WCHAR *' (aka 'const char16_t *') for 1st argument

@janvorli
Copy link
Member

That's weird - the "W" macro is used all over the PAL and it is defined as

#define W(str)  u##str

@janvorli
Copy link
Member

Maybe the problem is that you are missing #include "pal/palinternal.h"

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

I have that in itf8.h as #include < pal/palinternal.h > (I believe this is how it was in master). Should I switch to quotes?

@janvorli
Copy link
Member

Hmm, no, it should be included in the pal/utf8.h that you are already including.

@janvorli
Copy link
Member

You can try to generate preprocessed version of the utf8.cpp to see the reality.

  • Go to bin/obj/Linux.x64.Release/src/pal/src (or the .Debug)
  • run "make locale/utf8.cpp.i"

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

It reads:

EncoderReplacementFallback() : EncoderReplacementFallback(u"?")

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Is that going to evaluate to a WCHAR* ? The constructor is defined as EncoderReplacementFallback(WCHAR* replacement)

@janvorli
Copy link
Member

WCHAR should be defined as char16_t (you should see it in the signatures in the .i files). And the u"?" is correct - char16_t, so the error message you were getting doesn't make sense to me.

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Should I not have #include < wchar.h > in utf8.h?

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

I'm not seeing WCHAR defined to char16_t - the only place I see char16_t is
typedef char16_t __wchar_16_cpp__;

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Actually I also see typedef __wchar_16_cpp__ WCHAR;, so I'm not sure either

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

There is also
typedef WCHAR *PWCHAR;
typedef WCHAR *LPWCH, *PWCH;
typedef const WCHAR *LPCWCH, *PCWCH;
typedef WCHAR *NWPSTR;
typedef WCHAR *LPWSTR, *PWSTR;
typedef const WCHAR *LPCWSTR, *PCWSTR;

@janvorli
Copy link
Member

typedef char16_t __wchar_16_cpp__;
and
typedef __wchar_16_cpp__ WCHAR;
transitively means
typedef char16_t WCHAR;
right?

@wtgodbe
Copy link
Member Author

wtgodbe commented Mar 17, 2016

Yes that's what I was thinking, so I'm not sure what the issue is

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants