Update Unicode data and optimize CharUnicodeInfo indexes #20983

pentp · 2018-11-13T22:30:26Z

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

I converted the Perl script from #20864 (comment) to C# and included it in the Tools folder.
Changed the index structure and encoding, reducing index size from 37KB to 23KB.
I verified that all data returned for every Unicode character is identical to the previous implementation using the latest Unicode data.

Category data lookup times improved from 3.4ns to 2.4ns (1.4x faster) for char based lookups (from 5.7ns to 4.4ns for string based lookups). Tried with different kinds of text.
For comparison, char.GetUnicodeCategory for Latin1 data takes 1.0ns.

pentp · 2018-11-13T22:56:38Z

CC @tarekgh @jkotas

src/System.Private.CoreLib/Tools/GenUnicodeProp/DataTable.cs

src/System.Private.CoreLib/shared/System/Char.cs

tarekgh · 2018-11-13T23:59:42Z

src/System.Private.CoreLib/Tools/GenUnicodeProp/Program.cs

+
+        private static void PrintByteArray(string tableName, StreamWriter file, byte[] str)
+        {
+            file.Write("\n        private static ReadOnlySpan<byte> " + tableName + " => new byte[" + str.Length + "]\n        {\n");


(" [](start = 22, length = 2)

would using $".." would be better here?

I think it was this way in the perl script also, doesn't make much difference either way.

tarekgh · 2018-11-14T00:01:46Z

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

what data did we miss there?

tarekgh · 2018-11-14T00:02:49Z

@pentp thank you for getting this done. other than my minor comments, LGTM. I assume you ran the latest corefx tests against it. right?

tarekgh · 2018-11-14T00:15:46Z

CC @GrabYourPitchforks

jkotas · 2018-11-14T00:16:56Z

src/System.Private.CoreLib/shared/System/Globalization/CharUnicodeInfo.cs

            // Note that & has the lower precedence than addition, so don't forget the parathesis.
-            index = s_pCategoryLevel1Index[index + ((ch >> 4) & 0x000f)];
+            index = Unsafe.As<byte, ushort>(ref Unsafe.AsRef(in CategoryLevel2Index[(index << 6) + ((ch >> 3) & 0b111110)]));


This may or may not be aligned at ushort. It will cause unpredictable performance changing from build to build in better case and crashes on platforms with strict alignment requirements in worse case.

(Same comment for other uses of Unsafe.As.)

We need https://github.com/dotnet/corefx/issues/26948 to make this right.

There is absolutely no alignment guarantees for static <PrivateImplementationDetails> fields?

I can change it to Unsafe.ReadUnaligned or should I change it back to a regular ushort[] array? I would guess that an unaligned read might actually be faster here.

For InternalGetNumericValue I'll just change to unaligned read (it isn't 8-byte aligned in the current version also).

I do not think there is any alignment required by the spec, but the tools seems to produce at least 4-byte alignment in practice based on a few experiments. So changing this to Unsafe.ReadUnaligned (just to be sure) should be ok.

pentp · 2018-11-14T00:29:56Z

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

what data did we miss there?

By looking at https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt it looks like the previous data was actually Unicode 8.0???

This is an output diff between the version in master and an updated version (but still using the original Perl script): https://gist.github.com/pentp/944c49cebe3b7ae9c1f7539de16a5e6e

Maybe I have confused some file versions somewhere or something (I have a bunch of branches and compilations of CoreCLR). In any case, I need to run the latest CoreFX tests.

jkotas · 2018-11-14T01:27:45Z

src/System.Private.CoreLib/shared/System/Globalization/CharUnicodeInfo.cs

+                index = NumericLevel2Index[(index << 4) + ((ch >> 4) & 0x000f)];
+                index = NumericLevel3Index[(index << 4) + (ch & 0x000f)];
+
+                Debug.Assert(BitConverter.IsLittleEndian);


We want the code to be reusable in Mono. Could you please change this to dynamic check instead of assert.

Should add reversing logic for big-endian systems here or just throw?
The current implementation just returns garbage on big-endian systems in this function I think.

Add reversing logic for big-endian systems.

@jkotas this should be added inside #ifdef BIGENDIAN, right?

We prefer dynamic checks - it is what works for Mono and the JIT is usually able to optimize it out completely. Mono is not setup for BIGENDIAN ifdef. They have to ifdef all places with BIGENDIAN ifdef for Mono and replace it with dynamic check when reusing our code.

tarekgh · 2018-11-14T02:04:10Z

Could you please test your changes with the test https://github.com/dotnet/corefx/tree/master/src/System.Globalization/tests from corex master? just to ensure we work fine.

pentp · 2018-11-15T02:16:58Z

I added big-endian support and tested with CoreFX System.Globalization tests and verified that the test used UnicodeData.11.0.txt as the reference file.

tarekgh · 2018-11-15T16:50:38Z

thanks @pentp

@jkotas you are ok to merge? or have any more comments?

marek-safar · 2018-11-19T11:58:13Z

src/System.Private.CoreLib/shared/System/Globalization/CharUnicodeInfoData.cs

-        // 12:4:4 index table of the Unicode category data.
-        private static ushort[] s_pCategoryLevel1Index = new ushort[]
+        // 11:5:4 index table of the Unicode category data.
+        private static ReadOnlySpan<byte> CategoryLevel1Index => new byte[2176]


I think this relies on quite advanced JIT optimization to work better than previous code. Basically we are generating IL code which we hope to be executed differently than it's written. It's using C# getter syntax with newarr and expects JIT to convert this to aloc free read from static memory map.

/cc @jkotas @vargaz

The problem seems to be that this code will still compile with
csc /langversion:7.1, but it will generate a newobj + InitializeArray pair for every access to the
variable.

Presumably, the C# codegen should not be language version specific but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) availability

Right, this does not depend on any special JIT optimizations. This optimization is done by Roslyn: dotnet/roslyn#24621.

I'd suggest roslyn should fail to compile this code when it cannot make the optimization, since the fallback version is very slow.

I'd suggest roslyn should fail to compile this code when it cannot make the optimization

Open issue on this in Roslyn repo if you would like to see this happen.

It will never happen in practice since System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) is present in all ReadOnlySpan implementations out there.

It'll actually happen in practice because by default csc defaults to langversion:7 which generates the ugly code with newarr not the optimized version.

@marek-safar, what are you arguing for? That we not take advantage of improved C# compiler support to improve the efficiency of coreclr?

As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) availability

As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1::.ctor(void*,int32) availability

Ok, thanks. Sounds like an issue for the dotnet/roslyn repo.

…clr#20983) * Update Unicode data and optimize CharUnicodeInfo indexes * Add new GenUnicodeProp to tools * Add licence headers * Add big-endian support Commit migrated from dotnet/coreclr@8241c49

pentp added 2 commits November 13, 2018 23:18

Update Unicode data and optimize CharUnicodeInfo indexes

2dce7e3

Add new GenUnicodeProp to tools

965a767

tarekgh reviewed Nov 13, 2018

View reviewed changes

src/System.Private.CoreLib/Tools/GenUnicodeProp/DataTable.cs Show resolved Hide resolved

tarekgh reviewed Nov 13, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Char.cs Show resolved Hide resolved

tarekgh reviewed Nov 13, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Char.cs Show resolved Hide resolved

Add licence headers

aec1a3c

tarekgh reviewed Nov 13, 2018

View reviewed changes

tarekgh approved these changes Nov 14, 2018

View reviewed changes

jkotas reviewed Nov 14, 2018

View reviewed changes

Add big-endian support

f078404

jkotas approved these changes Nov 15, 2018

View reviewed changes

jkotas merged commit 8241c49 into dotnet:master Nov 15, 2018

pentp deleted the UnicodeData branch November 15, 2018 18:26

marek-safar reviewed Nov 19, 2018

View reviewed changes

svick mentioned this pull request May 15, 2020

Update Unicode version for .Net Core dotnet/dotnet-api-docs#4242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Unicode data and optimize CharUnicodeInfo indexes #20983

Update Unicode data and optimize CharUnicodeInfo indexes #20983

pentp commented Nov 13, 2018 •

edited

Loading

pentp commented Nov 13, 2018

tarekgh Nov 13, 2018

pentp Nov 14, 2018

tarekgh commented Nov 14, 2018

tarekgh commented Nov 14, 2018 •

edited

Loading

tarekgh commented Nov 14, 2018

jkotas Nov 14, 2018

jkotas Nov 14, 2018

pentp Nov 14, 2018 •

edited

Loading

jkotas Nov 14, 2018

pentp commented Nov 14, 2018 •

edited

Loading

jkotas Nov 14, 2018

pentp Nov 14, 2018

jkotas Nov 14, 2018

tarekgh Nov 14, 2018

jkotas Nov 14, 2018

tarekgh commented Nov 14, 2018

pentp commented Nov 15, 2018

tarekgh commented Nov 15, 2018

marek-safar Nov 19, 2018

vargaz Nov 19, 2018

marek-safar Nov 19, 2018 •

edited

Loading

jkotas Nov 19, 2018

vargaz Nov 19, 2018

jkotas Nov 19, 2018

marek-safar Nov 19, 2018

stephentoub Nov 19, 2018

marek-safar Nov 19, 2018

stephentoub Nov 19, 2018

Update Unicode data and optimize CharUnicodeInfo indexes #20983

Update Unicode data and optimize CharUnicodeInfo indexes #20983

Conversation

pentp commented Nov 13, 2018 • edited Loading

pentp commented Nov 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh commented Nov 14, 2018

tarekgh commented Nov 14, 2018 • edited Loading

tarekgh commented Nov 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentp Nov 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentp commented Nov 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh commented Nov 14, 2018

pentp commented Nov 15, 2018

tarekgh commented Nov 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marek-safar Nov 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pentp commented Nov 13, 2018 •

edited

Loading

tarekgh commented Nov 14, 2018 •

edited

Loading

pentp Nov 14, 2018 •

edited

Loading

pentp commented Nov 14, 2018 •

edited

Loading

marek-safar Nov 19, 2018 •

edited

Loading