Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Update Unicode data and optimize CharUnicodeInfo indexes #20983

Merged
merged 4 commits into from
Nov 15, 2018
Merged

Update Unicode data and optimize CharUnicodeInfo indexes #20983

merged 4 commits into from
Nov 15, 2018

Conversation

pentp
Copy link

@pentp pentp commented Nov 13, 2018

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

I converted the Perl script from #20864 (comment) to C# and included it in the Tools folder.
Changed the index structure and encoding, reducing index size from 37KB to 23KB.
I verified that all data returned for every Unicode character is identical to the previous implementation using the latest Unicode data.

Category data lookup times improved from 3.4ns to 2.4ns (1.4x faster) for char based lookups (from 5.7ns to 4.4ns for string based lookups). Tried with different kinds of text.
For comparison, char.GetUnicodeCategory for Latin1 data takes 1.0ns.

@pentp
Copy link
Author

pentp commented Nov 13, 2018

CC @tarekgh @jkotas


private static void PrintByteArray(string tableName, StreamWriter file, byte[] str)
{
file.Write("\n private static ReadOnlySpan<byte> " + tableName + " => new byte[" + str.Length + "]\n {\n");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(" [](start = 22, length = 2)

would using $".." would be better here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was this way in the perl script also, doesn't make much difference either way.

@tarekgh
Copy link
Member

tarekgh commented Nov 14, 2018

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

what data did we miss there?

@tarekgh
Copy link
Member

tarekgh commented Nov 14, 2018

@pentp thank you for getting this done. other than my minor comments, LGTM. I assume you ran the latest corefx tests against it. right?

@tarekgh
Copy link
Member

tarekgh commented Nov 14, 2018

CC @GrabYourPitchforks

// Note that & has the lower precedence than addition, so don't forget the parathesis.
index = s_pCategoryLevel1Index[index + ((ch >> 4) & 0x000f)];
index = Unsafe.As<byte, ushort>(ref Unsafe.AsRef(in CategoryLevel2Index[(index << 6) + ((ch >> 3) & 0b111110)]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may or may not be aligned at ushort. It will cause unpredictable performance changing from build to build in better case and crashes on platforms with strict alignment requirements in worse case.

(Same comment for other uses of Unsafe.As.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@pentp pentp Nov 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is absolutely no alignment guarantees for static <PrivateImplementationDetails> fields?

I can change it to Unsafe.ReadUnaligned or should I change it back to a regular ushort[] array? I would guess that an unaligned read might actually be faster here.

For InternalGetNumericValue I'll just change to unaligned read (it isn't 8-byte aligned in the current version also).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think there is any alignment required by the spec, but the tools seems to produce at least 4-byte alignment in practice based on a few experiments. So changing this to Unsafe.ReadUnaligned (just to be sure) should be ok.

@pentp
Copy link
Author

pentp commented Nov 14, 2018

Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).

what data did we miss there?

By looking at https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt it looks like the previous data was actually Unicode 8.0???

This is an output diff between the version in master and an updated version (but still using the original Perl script): https://gist.github.com/pentp/944c49cebe3b7ae9c1f7539de16a5e6e

Maybe I have confused some file versions somewhere or something (I have a bunch of branches and compilations of CoreCLR). In any case, I need to run the latest CoreFX tests.

index = NumericLevel2Index[(index << 4) + ((ch >> 4) & 0x000f)];
index = NumericLevel3Index[(index << 4) + (ch & 0x000f)];

Debug.Assert(BitConverter.IsLittleEndian);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the code to be reusable in Mono. Could you please change this to dynamic check instead of assert.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add reversing logic for big-endian systems here or just throw?
The current implementation just returns garbage on big-endian systems in this function I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add reversing logic for big-endian systems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas this should be added inside #ifdef BIGENDIAN, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer dynamic checks - it is what works for Mono and the JIT is usually able to optimize it out completely. Mono is not setup for BIGENDIAN ifdef. They have to ifdef all places with BIGENDIAN ifdef for Mono and replace it with dynamic check when reusing our code.

@tarekgh
Copy link
Member

tarekgh commented Nov 14, 2018

Could you please test your changes with the test https://github.com/dotnet/corefx/tree/master/src/System.Globalization/tests from corex master? just to ensure we work fine.

@pentp
Copy link
Author

pentp commented Nov 15, 2018

I added big-endian support and tested with CoreFX System.Globalization tests and verified that the test used UnicodeData.11.0.txt as the reference file.

@tarekgh
Copy link
Member

tarekgh commented Nov 15, 2018

thanks @pentp

@jkotas you are ok to merge? or have any more comments?

@jkotas jkotas merged commit 8241c49 into dotnet:master Nov 15, 2018
@pentp pentp deleted the UnicodeData branch November 15, 2018 18:26
// 12:4:4 index table of the Unicode category data.
private static ushort[] s_pCategoryLevel1Index = new ushort[]
// 11:5:4 index table of the Unicode category data.
private static ReadOnlySpan<byte> CategoryLevel1Index => new byte[2176]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this relies on quite advanced JIT optimization to work better than previous code. Basically we are generating IL code which we hope to be executed differently than it's written. It's using C# getter syntax with newarr and expects JIT to convert this to aloc free read from static memory map.

/cc @jkotas @vargaz

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems to be that this code will still compile with
csc /langversion:7.1, but it will generate a newobj + InitializeArray pair for every access to the
variable.

Copy link

@marek-safar marek-safar Nov 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, the C# codegen should not be language version specific but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) availability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this does not depend on any special JIT optimizations. This optimization is done by Roslyn: dotnet/roslyn#24621.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest roslyn should fail to compile this code when it cannot make the optimization, since the fallback version is very slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest roslyn should fail to compile this code when it cannot make the optimization

Open issue on this in Roslyn repo if you would like to see this happen.

It will never happen in practice since System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) is present in all ReadOnlySpan implementations out there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll actually happen in practice because by default csc defaults to langversion:7 which generates the ugly code with newarr not the optimized version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marek-safar, what are you arguing for? That we not take advantage of improved C# compiler support to improve the efficiency of coreclr?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1<uint8>::.ctor(void*,int32) availability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested above I think this feature should be enabled not based on /langversion but based on System.ReadOnlySpan``1::.ctor(void*,int32) availability

Ok, thanks. Sounds like an issue for the dotnet/roslyn repo.

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…clr#20983)

* Update Unicode data and optimize CharUnicodeInfo indexes

* Add new GenUnicodeProp to tools

* Add licence headers

* Add big-endian support


Commit migrated from dotnet/coreclr@8241c49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants