-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Ported managed Utf8/Unicode encoder/decoder to C++ for usage in PAL #3522
Conversation
@dotnet-bot test this please. |
I think you need to update the license headers to the new one we're using. |
@adityamandaleeka which files need to be changed? And where can I find the new header? |
@wtgodbe Aditia meant the added test files where there is the old license header. Look e.g. here for the current header: |
Alright, made the change to the license headers. |
@wtgodbe since github doesn't allow me to comment on the utf8.cpp file, let me put my comments here:
|
@janvorli Thanks for the suggestions. I've completed the first one, working on the second 2 now. |
@janvorli how do I know what errors to set last error to? |
@wtgodbe looking at MSDN doc for WideCharToMultiByte / MultiByteToWideChar as a source of inspiration, I can see the following errors: So we can use these, it looks like translating DecoderFallbackException / EncoderFallbackException to ERROR_NO_UNICODE_TRANSLATION and the ArgumentException to ERROR_INVALID_PARAMETER would make sense. |
Now that I look at the change again, It seems we can get rid of the body of the ThrowLastBytesRecursive with the exception of the throw. We don't use or even store the exception messages, so there is no point in carefully constructing it and then effectively throwing it away. And one nit - it would make sense to add a notice that this is a port of the C# version from mscorlib to the file header. |
Good point on ThrowLastBytes. Changes have been made. |
@wtgodbe There is no need to set the last error before the abort(), the process will exit right away and nothing will be able to extract it. |
Here's the results of running a performance benchmark before & after my change (ran with Release bits) - looks like there's a very slight performance increase of about 1%, which seems pretty negligible (although the deviation in results for each iteration was small enough to indicate that the change is real and not noise). I let each run for 10 iterations before taking an average. master - 7.5294 s Also @janvorli do you have any idea why the build would be failing on non-windows (although it seems to build fine on FreeBSD, then give silent errors on 2 UTF tests). Looks like the following is the error:
|
@wtgodbe, unrelated of the issues you are seeing, there is a problem with the order in which you catch the exceptions. You need to catch the fallback exceptions before the ArgumentException, since otherwise the fallback ones would always be caught as ArgumentException. I'm not sure about the crossgen issue - I assume it works fine on your local machine, right? I've also noticed that you have SetLastError in the DecoderExceptionFallbackBuffer::Fallback / EncoderExceptionFallbackBuffer::Fallback.. Since the whole decoder / encoder is exception based, please remove it from there - we set it in the catch for the exception now, which is a better place. If the silent failure is the abort in the catch, I wonder if the GetByteCount ever throws the EncoderFallbackException / DecoderFallbackException. Maybe these are thrown just when running the GetBytes. |
@janvorli it looks like I do have the crossgen failure on my machine, I just didn't notice it before because I was still able to build and run PAL tests. Now that I've added a couple debug print statements, I'm seeing that we hit an argument exception in getChars immediately before the crossgen failure. |
@janvorli I think I've found the problem - the ArgumentException is getting triggered somewhere with the message "Ran out of buffer space". Looking at the code, it looks like I replaced a call to a function named "ThrowCharsOverflow" in the original code, with an ArgumentException for a char overflow in my port. ThrowCharsOverflow appears to only throw an exception when nothing has been decoded yet (that is, when chars == charStart, which means we haven't advanced our pointer to the src string yet). I'm thinking I should just add an equivalent function to my port, what do you think? |
@wtgodbe yes, I would keep the port as close to the original code as possible. |
@janvorli are you sure that we should be aborting on ArgumentExceptions? This means we're aborting when we have insufficient buffer space, which is what's causing the failure in Test 3 for WideCharToMultiByte (and vice-versa) |
@wtgodbe Ah, I have not realized that the GetBytes have additional failure modes. Then we should handle the exceptions from the GetBytes the same way as we handle the ones from the GetByteCount. Just have a single try / catch there and put both the GetByteCount and GetBytes into it. |
@dotnet-bot test this please |
@janvorli is there an alternative to wcslen in PAL for OSX? It looks like the call to wcslen on line 888 in utf8.cpp is returning an unexpectedly high value on OSX, causing the test failure (it returns 15 on OSX there, 3 on Linux for the same string) |
I guess the problem is that the string returned by fallback->GetDefaultString is not zero terminated. It is just a buffer for up-to 2 wide chars. It seems you need to use the fallback->GetMaxCharCount() instead of the wcslen |
Should it still be 2 * GetMaxCharCount(), or just GetMaxCharCount()? |
Hmm, it seems I was wrong, the string returned from the GetDefaultString should be zero terminated after all, unless there is a bug - e.g. trying to assign string longer than 1 character to it. EncoderReplacementFallback((WCHAR*)L"?") Should be EncoderReplacementFallback(W("?")) instead. It would not cause the issue though. |
And this is wrong, maybe that's the real issue causing the trouble: 2 * wcslen((const wchar_t *)fallback->GetDefaultString()); PAL doesn't use wchar_t, which is 32 bit char on Unix, but WCHAR which is 16 bit char. So you get a pointer to 16 bit char and cast it to a pointer to 32 bit char. You need to use PAL_wcslen. |
Or the fallback->GetMaxCharCount(), of course |
Should I switch just
Or should I also switch
? |
You should also not have the #include <wchar.h> in the file. #include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h> |
You shoyuld change all places that use L"xx" to W("xx") and remove the casts. |
Are you sure about switching L"xx" to W("xx")? I'm getting compiler errors in those places now with the following:
|
That's weird - the "W" macro is used all over the PAL and it is defined as #define W(str) u##str |
Maybe the problem is that you are missing #include "pal/palinternal.h" |
I have that in itf8.h as #include < pal/palinternal.h > (I believe this is how it was in master). Should I switch to quotes? |
Hmm, no, it should be included in the pal/utf8.h that you are already including. |
You can try to generate preprocessed version of the utf8.cpp to see the reality.
|
It reads: EncoderReplacementFallback() : EncoderReplacementFallback(u"?") |
Is that going to evaluate to a WCHAR* ? The constructor is defined as EncoderReplacementFallback(WCHAR* replacement) |
WCHAR should be defined as char16_t (you should see it in the signatures in the .i files). And the u"?" is correct - char16_t, so the error message you were getting doesn't make sense to me. |
Should I not have #include < wchar.h > in utf8.h? |
I'm not seeing WCHAR defined to char16_t - the only place I see char16_t is |
Actually I also see |
There is also |
|
Yes that's what I was thinking, so I'm not sure what the issue is |
I ported the algorithm from src\mscorlib\src\System\Text\UTF8Encoding.cs to src\pal\src\locale\utf8.cpp, for consistency/performance reasons. Also added a couple of tests. See #1725 for some conversation on this issue.