-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Update Unicode data and optimize CharUnicodeInfo indexes #20983
Conversation
|
||
private static void PrintByteArray(string tableName, StreamWriter file, byte[] str) | ||
{ | ||
file.Write("\n private static ReadOnlySpan<byte> " + tableName + " => new byte[" + str.Length + "]\n {\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(" [](start = 22, length = 2)
would using $".." would be better here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was this way in the perl script also, doesn't make much difference either way.
what data did we miss there? |
@pentp thank you for getting this done. other than my minor comments, LGTM. I assume you ran the latest corefx tests against it. right? |
// Note that & has the lower precedence than addition, so don't forget the parathesis. | ||
index = s_pCategoryLevel1Index[index + ((ch >> 4) & 0x000f)]; | ||
index = Unsafe.As<byte, ushort>(ref Unsafe.AsRef(in CategoryLevel2Index[(index << 6) + ((ch >> 3) & 0b111110)])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may or may not be aligned at ushort
. It will cause unpredictable performance changing from build to build in better case and crashes on platforms with strict alignment requirements in worse case.
(Same comment for other uses of Unsafe.As
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need https://github.com/dotnet/corefx/issues/26948 to make this right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is absolutely no alignment guarantees for static <PrivateImplementationDetails>
fields?
I can change it to Unsafe.ReadUnaligned
or should I change it back to a regular ushort[]
array? I would guess that an unaligned read might actually be faster here.
For InternalGetNumericValue
I'll just change to unaligned read (it isn't 8-byte aligned in the current version also).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think there is any alignment required by the spec, but the tools seems to produce at least 4-byte alignment in practice based on a few experiments. So changing this to Unsafe.ReadUnaligned
(just to be sure) should be ok.
By looking at https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt it looks like the previous data was actually Unicode 8.0??? This is an output diff between the version in master and an updated version (but still using the original Perl script): https://gist.github.com/pentp/944c49cebe3b7ae9c1f7539de16a5e6e Maybe I have confused some file versions somewhere or something (I have a bunch of branches and compilations of CoreCLR). In any case, I need to run the latest CoreFX tests. |
index = NumericLevel2Index[(index << 4) + ((ch >> 4) & 0x000f)]; | ||
index = NumericLevel3Index[(index << 4) + (ch & 0x000f)]; | ||
|
||
Debug.Assert(BitConverter.IsLittleEndian); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the code to be reusable in Mono. Could you please change this to dynamic check instead of assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add reversing logic for big-endian systems here or just throw?
The current implementation just returns garbage on big-endian systems in this function I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add reversing logic for big-endian systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas this should be added inside #ifdef BIGENDIAN, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer dynamic checks - it is what works for Mono and the JIT is usually able to optimize it out completely. Mono is not setup for BIGENDIAN
ifdef. They have to ifdef all places with BIGENDIAN ifdef for Mono and replace it with dynamic check when reusing our code.
Could you please test your changes with the test https://github.com/dotnet/corefx/tree/master/src/System.Globalization/tests from corex master? just to ensure we work fine. |
I added big-endian support and tested with CoreFX System.Globalization tests and verified that the test used UnicodeData.11.0.txt as the reference file. |
// 12:4:4 index table of the Unicode category data. | ||
private static ushort[] s_pCategoryLevel1Index = new ushort[] | ||
// 11:5:4 index table of the Unicode category data. | ||
private static ReadOnlySpan<byte> CategoryLevel1Index => new byte[2176] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this relies on quite advanced JIT optimization to work better than previous code. Basically we are generating IL code which we hope to be executed differently than it's written. It's using C# getter syntax with newarr and expects JIT to convert this to aloc free read from static memory map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem seems to be that this code will still compile with
csc /langversion:7.1, but it will generate a newobj + InitializeArray pair for every access to the
variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, the C# codegen should not be language version specific but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32)
availability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this does not depend on any special JIT optimizations. This optimization is done by Roslyn: dotnet/roslyn#24621.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest roslyn should fail to compile this code when it cannot make the optimization, since the fallback version is very slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest roslyn should fail to compile this code when it cannot make the optimization
Open issue on this in Roslyn repo if you would like to see this happen.
It will never happen in practice since System.ReadOnlySpan``1<uint8>::.ctor(void*,int32)
is present in all ReadOnlySpan
implementations out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll actually happen in practice because by default csc defaults to langversion:7 which generates the ugly code with newarr not the optimized version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marek-safar, what are you arguing for? That we not take advantage of improved C# compiler support to improve the efficiency of coreclr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32)
availability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1::.ctor(void*,int32) availability
Ok, thanks. Sounds like an issue for the dotnet/roslyn repo.
…clr#20983) * Update Unicode data and optimize CharUnicodeInfo indexes * Add new GenUnicodeProp to tools * Add licence headers * Add big-endian support Commit migrated from dotnet/coreclr@8241c49
Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).
I converted the Perl script from #20864 (comment) to C# and included it in the Tools folder.
Changed the index structure and encoding, reducing index size from 37KB to 23KB.
I verified that all data returned for every Unicode character is identical to the previous implementation using the latest Unicode data.
Category data lookup times improved from 3.4ns to 2.4ns (1.4x faster) for char based lookups (from 5.7ns to 4.4ns for string based lookups). Tried with different kinds of text.
For comparison,
char.GetUnicodeCategory
for Latin1 data takes 1.0ns.