Shrink unicode case-mapping LUTs by 24k #109216

martingms · 2023-03-16T12:13:21Z

I was looking into the binary bloat of a small program using str::to_lowercase and str::to_uppercase, and noticed that the lookup tables used for case mapping had a lot of zero-bytes in them. The reason for this is that since some characters map to up to three other characters when lower or uppercased, the LUTs store a [char; 3] for each character. However, the vast majority of cases only map to a single new character, in other words most of the entries are e.g. (lowerc, [upperc, '\0', '\0']).
This PR introduces a new encoding scheme for these tables.

The changes reduces the size of my test binary by about 24K.

I've also done some #[bench]marks on unicode-heavy test data, and found that the performance of both str::to_lowercase and str::to_uppercase improves by up to 20%. These measurements are obviously very dependent on the character distribution of the data.

Someone else will have to decide whether this more complex scheme is worth it or not, I was just goofing around a bit and here's what came out of it 🤷‍♂️ No hard feelings if this isn't wanted!

Since ascii chars are already handled by a special case in the `to_lower` and `to_upper` functions, there's no need to waste space on them in the LUTs.

The majority of char case replacements are single char replacements, so storing them as [char; 3] wastes a lot of space. This commit splits the replacement tables for both `to_lower` and `to_upper` into two separate tables, one with single-character mappings and one with multi-character mappings. This reduces the binary size for programs using all of these tables with roughly 24K bytes.

rustbot · 2023-03-16T12:13:29Z

r? @Mark-Simulacrum

(rustbot has picked a reviewer for you, use r? to override)

rustbot · 2023-03-16T12:13:32Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

workingjubilee · 2023-03-16T13:46:35Z

cc @thomcc

nikic · 2023-03-16T16:29:37Z

It would be possible to avoid the second table lookup by storing an index into the table with the long expansions. Something like (1 << 24) | large_index. This does mean that it can't be stored as char and needs to use u32 instead, because these would be stored in the invalid code point space.

martingms · 2023-03-16T16:51:39Z

It would be possible to avoid the second table lookup by storing an index into the table with the long expansions. Something like (1 << 24) | large_index. This does mean that it can't be stored as char and needs to use u32 instead, because these would be stored in the invalid code point space.

That is a very good idea!

I'll see if I can get something working a bit later.

EDIT: Seems to be working nicely @nikic, changed the slowdown on uppercasing into a speedup there too 👍

@nikic

The indices are encoded as `u32`s in the range of invalid `char`s, so that we know that if any mapping fails to parse as a `char` we should use the value for lookup in the multi-table. This avoids the second binary search in cases where a multi-`char` mapping is needed. Idea from @nikic

rustbot · 2023-03-16T20:46:34Z

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

Stabilizing library features
Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
Changing public documentation in ways that create new stability guarantees
Changing observable runtime behavior of library APIs

Mark-Simulacrum · 2023-03-18T18:29:41Z

@bors try @rust-timer queue

bors · 2023-03-18T18:29:53Z

⌛ Trying commit 355e1dd with merge 7dff56ce084e1c8649af8ffed641f398d0e5ac9f...

bors · 2023-03-18T20:55:03Z

☀️ Try build successful - checks-actions
Build commit: 7dff56ce084e1c8649af8ffed641f398d0e5ac9f (7dff56ce084e1c8649af8ffed641f398d0e5ac9f)

bors · 2023-03-18T20:55:03Z

☀️ Try build successful - checks-actions
Build commit: 7dff56ce084e1c8649af8ffed641f398d0e5ac9f (7dff56ce084e1c8649af8ffed641f398d0e5ac9f)

Mark-Simulacrum · 2023-03-18T21:07:40Z

r=me presuming perf looks ok

rust-timer · 2023-03-18T22:41:40Z

Finished benchmarking commit (7dff56ce084e1c8649af8ffed641f398d0e5ac9f): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	2.2%	[1.9%, 2.5%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.1%	[-2.1%, -2.1%]	1
Improvements ✅ (secondary)	-3.2%	[-3.7%, -2.7%]	3
All ❌✅ (primary)	0.8%	[-2.1%, 2.5%]	3

Cycles

This benchmark run did not return any relevant results for this metric.

Mark-Simulacrum · 2023-03-18T22:55:28Z

@bors r+ rollup=iffy

bors · 2023-03-18T22:55:29Z

📌 Commit 355e1dd has been approved by Mark-Simulacrum

It is now in the queue for this repository.

bors · 2023-03-18T22:55:30Z

🌲 The tree is currently closed for pull requests below priority 100. This pull request will be tested once the tree is reopened.

library/core/src/unicode/unicode_data.rs

Mark-Simulacrum · 2023-03-21T10:04:35Z

@bors r+ rollup=never

bors · 2023-03-21T10:04:37Z

📌 Commit 54f55ef has been approved by Mark-Simulacrum

It is now in the queue for this repository.

bors · 2023-03-24T05:10:55Z

⌛ Testing commit 54f55ef with merge e5101b55d472f766f6e09e9077d59c575e31d3dd...

bors · 2023-03-24T06:10:10Z

💔 Test failed - checks-actions

rust-log-analyzer · 2023-03-24T06:12:12Z

The job x86_64-mingw-1 failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

failures:

---- [rustdoc] tests\rustdoc\issue-108231.rs stdout ----

error: rustdoc failed!
status: exit code: 0xc00000fd
command: PATH="C:\a\rust\rust\build\x86_64-pc-windows-gnu\stage2\bin;C:\a\rust\rust\build\x86_64-pc-windows-gnu\stage0-bootstrap-tools\x86_64-pc-windows-gnu\release\deps;C:\a\rust\rust\build\x86_64-pc-windows-gnu\stage0\bin;C:\a\rust\rust\ninja;C:\a\rust\rust\mingw64\bin;C:\hostedtoolcache\windows\Python\3.11.2\x64\Scripts;C:\hostedtoolcache\windows\Python\3.11.2\x64;C:\msys64\usr\bin;C:\a\rust\rust\sccache;C:\Program Files\MongoDB\Server\5.0\bin;C:\aliyun-cli;C:\vcpkg;C:\cf-cli;C:\Program Files (x86)\NSIS;C:\tools\zstd;C:\Program Files\Mercurial;C:\hostedtoolcache\windows\stack\2.9.3\x64;C:\cabal\bin;C:\ghcup\bin;C:\Program Files\dotnet;C:\mysql\bin;C:\Program Files\R\R-4.2.2\bin\x64;C:\SeleniumWebDrivers\GeckoDriver;C:\Program Files (x86)\sbt\bin;C:\Program Files (x86)\GitHub CLI;C:\Program Files\Git\bin;C:\Program Files (x86)\pipx_bin;C:\npm\prefix;C:\hostedtoolcache\windows\go\1.17.13\x64\bin;C:\hostedtoolcache\windows\Python\3.7.9\x64\Scripts;C:\hostedtoolcache\windows\Python\3.7.9\x64;C:\hostedtoolcache\windows\Ruby\2.5.9\x64\bin;C:\tools\kotlinc\bin;C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\8.0.362-9\x64\bin;C:\Program Files\ImageMagick-7.1.1-Q16-HDRI;C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin;C:\ProgramData\kind;C:\Program Files\Eclipse Foundation\jdk-8.0.302.8-hotspot\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0;C:\Windows\System32\OpenSSH;C:\ProgramData\Chocolatey\bin;C:\Program Files\PowerShell\7;C:\Program Files\Microsoft\Web Platform Installer;C:\Program Files\Microsoft SQL Server\130\Tools\Binn;C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\170\Tools\Binn;C:\Program Files (x86)\Windows Kits\10\Windows Performance Toolkit;C:\Program Files (x86)\Microsoft SQL Server\110\DTS\Binn;C:\Program Files (x86)\Microsoft SQL Server\120\DTS\Binn;C:\Program Files (x86)\Microsoft SQL Server\130\DTS\Binn;C:\Program Files (x86)\Microsoft SQL Server\140\DTS\Binn;C:\Program Files (x86)\Microsoft SQL Server\150\DTS\Binn;C:\Program Files (x86)\Microsoft SQL Server\160\DTS\Binn;C:\Program Files\OpenSSL\bin;C:\Strawberry\c\bin;C:\Strawberry\perl\site\bin;C:\Strawberry\perl\bin;C:\ProgramData\chocolatey\lib\pulumi\tools\Pulumi\bin;C:\Program Files\TortoiseSVN\bin;C:\Program Files\CMake\bin;C:\ProgramData\chocolatey\lib\maven\apache-maven-3.8.7\bin;C:\Program Files\Microsoft Service Fabric\bin\Fabric\Fabric.Code;C:\Program Files\Microsoft SDKs\Service Fabric\Tools\ServiceFabricLocalClusterManager;C:\Program Files\nodejs;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Program Files\GitHub CLI;C:\tools\php;C:\Program Files (x86)\sbt\bin;C:\SeleniumWebDrivers\ChromeDriver;C:\SeleniumWebDrivers\EdgeDriver;C:\Program Files\Amazon\AWSCLIV2;C:\Program Files\Amazon\SessionManagerPlugin\bin;C:\Program Files\Amazon\AWSSAMCLI\bin;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files (x86)\Microsoft BizTalk Server;C:\Program Files\LLVM\bin;C:\Users\runneradmin\.dotnet\tools;C:\Users\runneradmin\.cargo\bin;C:\Users\runneradmin\AppData\Local\Microsoft\WindowsApps" "C:\\a\\rust\\rust\\build\\x86_64-pc-windows-gnu\\stage2\\bin\\rustdoc.exe" "-L" "C:\\a\\rust\\rust\\build\\x86_64-pc-windows-gnu\\stage2\\lib\\rustlib\\x86_64-pc-windows-gnu\\lib" "-L" "C:\\a\\rust\\rust\\build\\x86_64-pc-windows-gnu\\test\\rustdoc\\issue-108231\\auxiliary" "-o" "C:\\a\\rust\\rust\\build\\x86_64-pc-windows-gnu\\test\\rustdoc\\issue-108231" "--deny" "warnings" "C:\\a\\rust\\rust\\tests\\rustdoc\\issue-108231.rs"
stdout: none
--- stderr -------------------------------
thread 'main' has overflowed its stack



failures:
failures:
    [rustdoc] tests\rustdoc\issue-108231.rs

test result: FAILED. 595 passed; 1 failed; 6 ignored; 0 measured; 0 filtered out; finished in 117.34s

Some tests failed in compiletest suite=rustdoc mode=rustdoc host=x86_64-pc-windows-gnu target=x86_64-pc-windows-gnu
Build completed unsuccessfully in 0:44:25
make: *** [Makefile:78: ci-mingw-subset-1] Error 1

martingms · 2023-03-24T10:08:31Z

broken_heart Test failed - checks-actions

Seems unrelated?

lqd · 2023-03-24T10:15:26Z

Let's see if it's spurious

@bors retry

bors · 2023-03-24T10:33:45Z

⌛ Testing commit 54f55ef with merge f421586...

bors · 2023-03-24T13:08:51Z

☀️ Test successful - checks-actions
Approved by: Mark-Simulacrum
Pushing f421586 to master...

rust-timer · 2023-03-24T15:55:47Z

Finished benchmarking commit (f421586): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-1.0%	[-1.0%, -1.0%]	1
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	1.0%	[0.5%, 1.4%]	2
Improvements ✅ (primary)	-3.3%	[-3.3%, -3.3%]	1
Improvements ✅ (secondary)	-0.9%	[-2.1%, -0.5%]	5
All ❌✅ (primary)	-3.3%	[-3.3%, -3.3%]	1

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.9%	[-1.3%, -0.4%]	2
All ❌✅ (primary)	-	-	0

martingms added 2 commits March 15, 2023 17:27

Skip serializing ascii chars in case LUTs

8a4eb9e

Since ascii chars are already handled by a special case in the `to_lower` and `to_upper` functions, there's no need to waste space on them in the LUTs.

rustbot assigned Mark-Simulacrum Mar 16, 2023

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Mar 16, 2023

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 18, 2023

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Mar 18, 2023

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 18, 2023

scottmcm reviewed Mar 20, 2023

View reviewed changes

library/core/src/unicode/unicode_data.rs Outdated Show resolved Hide resolved

Use hex literal for INDEX_MASK

54f55ef

bors added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Mar 24, 2023

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 24, 2023

bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 24, 2023

bors merged commit f421586 into rust-lang:master Mar 24, 2023

rustbot added this to the 1.70.0 milestone Mar 24, 2023

bors mentioned this pull request Mar 24, 2023

remove unused static data for to_lowercase #107502

Closed

workingjubilee mentioned this pull request Mar 31, 2023

core wanted features & bugfixes Rust-for-Linux/linux#514

Open

16 tasks

Shrink unicode case-mapping LUTs by 24k #109216

Shrink unicode case-mapping LUTs by 24k #109216

Uh oh!

Conversation

martingms commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Mar 16, 2023

Uh oh!

rustbot commented Mar 16, 2023

Uh oh!

workingjubilee commented Mar 16, 2023

Uh oh!

nikic commented Mar 16, 2023

Uh oh!

martingms commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Mar 16, 2023

Uh oh!

Mark-Simulacrum commented Mar 18, 2023

Uh oh!

This comment has been minimized.

bors commented Mar 18, 2023

Uh oh!

bors commented Mar 18, 2023

Uh oh!

bors commented Mar 18, 2023

Uh oh!

This comment has been minimized.

Mark-Simulacrum commented Mar 18, 2023

Uh oh!

rust-timer commented Mar 18, 2023

Overall result: no relevant changes - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Uh oh!

Mark-Simulacrum commented Mar 18, 2023

Uh oh!

bors commented Mar 18, 2023

Uh oh!

bors commented Mar 18, 2023

Uh oh!

Uh oh!

Mark-Simulacrum commented Mar 21, 2023

Uh oh!

bors commented Mar 21, 2023

Uh oh!

bors commented Mar 24, 2023

Uh oh!

bors commented Mar 24, 2023

Uh oh!

rust-log-analyzer commented Mar 24, 2023

Uh oh!

martingms commented Mar 24, 2023

Uh oh!

lqd commented Mar 24, 2023

Uh oh!

bors commented Mar 24, 2023

Uh oh!

bors commented Mar 24, 2023

Uh oh!

rust-timer commented Mar 24, 2023

Overall result: ✅ improvements - no action needed

Instruction count

Max RSS (memory usage)

Cycles

Uh oh!

Uh oh!

martingms commented Mar 16, 2023 •

edited

Loading

martingms commented Mar 16, 2023 •

edited

Loading