Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL #3522

wtgodbe · 2016-03-04T22:40:42Z

I ported the algorithm from src\mscorlib\src\System\Text\UTF8Encoding.cs to src\pal\src\locale\utf8.cpp, for consistency/performance reasons. Also added a couple of tests. See #1725 for some conversation on this issue.

wtgodbe · 2016-03-04T22:40:54Z

@janvorli @sergiy-k PTAL

wtgodbe · 2016-03-05T00:02:34Z

@dotnet-bot test this please.

adityamandaleeka · 2016-03-05T00:17:37Z

I think you need to update the license headers to the new one we're using.

wtgodbe · 2016-03-05T00:20:51Z

@adityamandaleeka which files need to be changed? And where can I find the new header?

janvorli · 2016-03-07T22:44:38Z

@wtgodbe Aditia meant the added test files where there is the old license header. Look e.g. here for the current header:
https://github.com/dotnet/coreclr/blob/master/src/pal/inc/pal.h

wtgodbe · 2016-03-07T23:19:34Z

Alright, made the change to the license headers.

janvorli · 2016-03-07T23:52:39Z

@wtgodbe since github doesn't allow me to comment on the utf8.cpp file, let me put my comments here:

Please don't use STL (string / vector). We don't use it anywhere in the PAL, dynamic memory allocation with "new" operator in the PAL is a problem and I guess the STL would use that. It seems that it should be quite simple to change the code to not to use it.
In the UTF8ToUnicode and UnicodeToUTF8 functions, you swallow the ArgumentException from the GetChars / GetBytes. I would add abort to the catch, since such failure is fatal and means there is a bug in the decoder and the caller would get broken result.
The UTF8ToUnicode and UnicodeToUTF8 functions should consistently set last error in case of all errors, now it happens only for ERROR_INSUFFICIENT_BUFFER.

wtgodbe · 2016-03-08T23:49:30Z

@janvorli Thanks for the suggestions. I've completed the first one, working on the second 2 now.

wtgodbe · 2016-03-08T23:55:20Z

@janvorli how do I know what errors to set last error to?

janvorli · 2016-03-09T00:13:06Z

@wtgodbe looking at MSDN doc for WideCharToMultiByte / MultiByteToWideChar as a source of inspiration, I can see the following errors:
ERROR_INSUFFICIENT_BUFFER. A supplied buffer size was not large enough, or it was incorrectly set to NULL.
ERROR_INVALID_FLAGS. The values supplied for flags were not valid.
ERROR_INVALID_PARAMETER. Any of the parameter values was invalid.
ERROR_NO_UNICODE_TRANSLATION. Invalid Unicode was found in a string.

So we can use these, it looks like translating DecoderFallbackException / EncoderFallbackException to ERROR_NO_UNICODE_TRANSLATION and the ArgumentException to ERROR_INVALID_PARAMETER would make sense.

janvorli · 2016-03-09T00:23:34Z

Now that I look at the change again, It seems we can get rid of the body of the ThrowLastBytesRecursive with the exception of the throw. We don't use or even store the exception messages, so there is no point in carefully constructing it and then effectively throwing it away.

And one nit - it would make sense to add a notice that this is a port of the C# version from mscorlib to the file header.

wtgodbe · 2016-03-09T00:29:25Z

Good point on ThrowLastBytes. Changes have been made.

janvorli · 2016-03-09T00:39:44Z

@wtgodbe There is no need to set the last error before the abort(), the process will exit right away and nothing will be able to extract it.
But I would add catch (const DecoderFallbackException&) and catch (const EncoderFallbackException&) to the try with the GetByteCount and set the last error to ERROR_NO_UNICODE_TRANSLATION.

wtgodbe · 2016-03-09T20:55:47Z

Here's the results of running a performance benchmark before & after my change (ran with Release bits) - looks like there's a very slight performance increase of about 1%, which seems pretty negligible (although the deviation in results for each iteration was small enough to indicate that the change is real and not noise). I let each run for 10 iterations before taking an average.

master - 7.5294 s
utf8 - 7.4741 s

Also @janvorli do you have any idea why the build would be failing on non-windows (although it seems to build fine on FreeBSD, then give silent errors on 2 UTF tests). Looks like the following is the error:

10:43:19 Generating native image for mscorlib.
10:43:22 Microsoft (R) CoreCLR Native Image Generator - Version 4.5.22220.0
10:43:22 Copyright (c) Microsoft Corporation. All rights reserved.
10:43:22
10:43:22 ./build.sh: line 225: 20584 Aborted (core dumped) $__BinDir/crossgen $__BinDir/mscorlib.dll
10:43:22 Failed to generate native image for mscorlib.

janvorli · 2016-03-09T21:40:48Z

@wtgodbe, unrelated of the issues you are seeing, there is a problem with the order in which you catch the exceptions. You need to catch the fallback exceptions before the ArgumentException, since otherwise the fallback ones would always be caught as ArgumentException.

I'm not sure about the crossgen issue - I assume it works fine on your local machine, right?
As for the failing tests on FreeBSD, it seems that the only silent abort would be the one that you have in the catch in the UnicodeToUTF8 / UTF8ToUnicode. Maybe you can add some debug prints there to see.

I've also noticed that you have SetLastError in the DecoderExceptionFallbackBuffer::Fallback / EncoderExceptionFallbackBuffer::Fallback.. Since the whole decoder / encoder is exception based, please remove it from there - we set it in the catch for the exception now, which is a better place.

If the silent failure is the abort in the catch, I wonder if the GetByteCount ever throws the EncoderFallbackException / DecoderFallbackException. Maybe these are thrown just when running the GetBytes.

wtgodbe · 2016-03-09T22:41:14Z

@janvorli it looks like I do have the crossgen failure on my machine, I just didn't notice it before because I was still able to build and run PAL tests. Now that I've added a couple debug print statements, I'm seeing that we hit an argument exception in getChars immediately before the crossgen failure.

wtgodbe · 2016-03-11T20:59:20Z

@janvorli I think I've found the problem - the ArgumentException is getting triggered somewhere with the message "Ran out of buffer space". Looking at the code, it looks like I replaced a call to a function named "ThrowCharsOverflow" in the original code, with an ArgumentException for a char overflow in my port. ThrowCharsOverflow appears to only throw an exception when nothing has been decoded yet (that is, when chars == charStart, which means we haven't advanced our pointer to the src string yet). I'm thinking I should just add an equivalent function to my port, what do you think?

janvorli · 2016-03-11T21:02:46Z

@wtgodbe yes, I would keep the port as close to the original code as possible.

wtgodbe · 2016-03-11T21:20:31Z

@janvorli are you sure that we should be aborting on ArgumentExceptions? This means we're aborting when we have insufficient buffer space, which is what's causing the failure in Test 3 for WideCharToMultiByte (and vice-versa)

janvorli · 2016-03-11T21:30:25Z

@wtgodbe Ah, I have not realized that the GetBytes have additional failure modes. Then we should handle the exceptions from the GetBytes the same way as we handle the ones from the GetByteCount. Just have a single try / catch there and put both the GetByteCount and GetBytes into it.

wtgodbe · 2016-03-17T18:05:46Z

@dotnet-bot test this please

wtgodbe · 2016-03-17T21:21:25Z

@janvorli is there an alternative to wcslen in PAL for OSX? It looks like the call to wcslen on line 888 in utf8.cpp is returning an unexpectedly high value on OSX, causing the test failure (it returns 15 on OSX there, 3 on Linux for the same string)

janvorli · 2016-03-17T21:33:12Z

I guess the problem is that the string returned by fallback->GetDefaultString is not zero terminated. It is just a buffer for up-to 2 wide chars. It seems you need to use the fallback->GetMaxCharCount() instead of the wcslen

wtgodbe · 2016-03-17T21:36:25Z

Should it still be 2 * GetMaxCharCount(), or just GetMaxCharCount()?

janvorli · 2016-03-17T21:47:20Z

Hmm, it seems I was wrong, the string returned from the GetDefaultString should be zero terminated after all, unless there is a bug - e.g. trying to assign string longer than 1 character to it.
Btw, this

EncoderReplacementFallback((WCHAR*)L"?")

Should be

EncoderReplacementFallback(W("?"))

instead. It would not cause the issue though.

janvorli · 2016-03-17T21:50:54Z

And this is wrong, maybe that's the real issue causing the trouble:

2 * wcslen((const wchar_t *)fallback->GetDefaultString());

PAL doesn't use wchar_t, which is 32 bit char on Unix, but WCHAR which is 16 bit char. So you get a pointer to 16 bit char and cast it to a pointer to 32 bit char. You need to use PAL_wcslen.

janvorli · 2016-03-17T21:51:35Z

Or the fallback->GetMaxCharCount(), of course

wtgodbe · 2016-03-17T21:53:43Z

Should I switch just

EncoderReplacementFallback((WCHAR*)L"?")

Or should I also switch

DecoderReplacementFallback((WCHAR*)L"?")

?

janvorli · 2016-03-17T21:54:48Z

You should also not have the #include <wchar.h> in the file.
And I also wonder if we need these either:

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>

janvorli · 2016-03-17T21:55:33Z

You shoyuld change all places that use L"xx" to W("xx") and remove the casts.

wtgodbe · 2016-03-17T22:04:36Z

Are you sure about switching L"xx" to W("xx")? I'm getting compiler errors in those places now with the following:

candidate function not viable: no known conversion from 'const wchar_t *' to 'const WCHAR *' (aka 'const char16_t *') for 1st argument

janvorli · 2016-03-17T22:12:41Z

That's weird - the "W" macro is used all over the PAL and it is defined as

#define W(str)  u##str

janvorli · 2016-03-17T22:14:10Z

Maybe the problem is that you are missing #include "pal/palinternal.h"

wtgodbe · 2016-03-17T22:14:57Z

I have that in itf8.h as #include < pal/palinternal.h > (I believe this is how it was in master). Should I switch to quotes?

janvorli · 2016-03-17T22:15:17Z

Hmm, no, it should be included in the pal/utf8.h that you are already including.

janvorli · 2016-03-17T22:20:49Z

You can try to generate preprocessed version of the utf8.cpp to see the reality.

Go to bin/obj/Linux.x64.Release/src/pal/src (or the .Debug)
run "make locale/utf8.cpp.i"

wtgodbe · 2016-03-17T22:24:45Z

It reads:

EncoderReplacementFallback() : EncoderReplacementFallback(u"?")

wtgodbe · 2016-03-17T22:25:44Z

Is that going to evaluate to a WCHAR* ? The constructor is defined as EncoderReplacementFallback(WCHAR* replacement)

janvorli · 2016-03-17T22:32:19Z

WCHAR should be defined as char16_t (you should see it in the signatures in the .i files). And the u"?" is correct - char16_t, so the error message you were getting doesn't make sense to me.

wtgodbe · 2016-03-17T22:34:27Z

Should I not have #include < wchar.h > in utf8.h?

wtgodbe · 2016-03-17T22:35:06Z

I'm not seeing WCHAR defined to char16_t - the only place I see char16_t is
typedef char16_t __wchar_16_cpp__;

wtgodbe · 2016-03-17T22:35:55Z

Actually I also see typedef __wchar_16_cpp__ WCHAR;, so I'm not sure either

wtgodbe · 2016-03-17T22:36:21Z

There is also
typedef WCHAR *PWCHAR;
typedef WCHAR *LPWCH, *PWCH;
typedef const WCHAR *LPCWCH, *PCWCH;
typedef WCHAR *NWPSTR;
typedef WCHAR *LPWSTR, *PWSTR;
typedef const WCHAR *LPCWSTR, *PCWSTR;

janvorli · 2016-03-17T22:38:04Z

typedef char16_t __wchar_16_cpp__;
and
typedef __wchar_16_cpp__ WCHAR;
transitively means
typedef char16_t WCHAR;
right?

wtgodbe · 2016-03-17T22:39:41Z

Yes that's what I was thinking, so I'm not sure what the issue is

dnfclas added the cla-already-signed label Mar 4, 2016

dotnet-bot added the 2 - In Progress label Mar 4, 2016

Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL

eef4ca5

wtgodbe closed this Mar 18, 2016

dotnet-bot removed the 2 - In Progress label Mar 18, 2016

Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL #3522

Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL #3522

Conversation

wtgodbe commented Mar 4, 2016

wtgodbe commented Mar 4, 2016

wtgodbe commented Mar 5, 2016

adityamandaleeka commented Mar 5, 2016

wtgodbe commented Mar 5, 2016

janvorli commented Mar 7, 2016

wtgodbe commented Mar 7, 2016

janvorli commented Mar 7, 2016

wtgodbe commented Mar 8, 2016

wtgodbe commented Mar 8, 2016

janvorli commented Mar 9, 2016

janvorli commented Mar 9, 2016

wtgodbe commented Mar 9, 2016

janvorli commented Mar 9, 2016

wtgodbe commented Mar 9, 2016

janvorli commented Mar 9, 2016

wtgodbe commented Mar 9, 2016

wtgodbe commented Mar 11, 2016

janvorli commented Mar 11, 2016

wtgodbe commented Mar 11, 2016

janvorli commented Mar 11, 2016

wtgodbe commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

janvorli commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

wtgodbe commented Mar 17, 2016

janvorli commented Mar 17, 2016

wtgodbe commented Mar 17, 2016