Skip to content

Rollup of 5 pull requests #70499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Mar 28, 2020
Merged

Rollup of 5 pull requests #70499

merged 27 commits into from
Mar 28, 2020

Conversation

Dylan-DPC-zz
Copy link

Successful merges:

Failed merges:

r? @ghost

Mark-Simulacrum and others added 27 commits March 20, 2020 12:11
If the unicode-downloads folder already exists, we likely just fetched the data,
so don't make any further network requests. Unicode versions are released rarely
enough that this doesn't matter much in practice.
Try chunk sizes between 1 and 64, selecting the one which minimizes the number
of bytes used. 16, the previous constant, turned out to be a rather good choice,
with 5/9 of the datasets still using it.

Alphabetic     : 3036 bytes    (- 19 bytes)
Case_Ignorable : 2136 bytes
Cased          : 934 bytes
Cc             : 32 bytes      (- 11 bytes)
Grapheme_Extend: 1774 bytes
Lowercase      : 985 bytes
N              : 1225 bytes    (- 41 bytes)
Uppercase      : 934 bytes
White_Space    : 97 bytes      (- 43 bytes)
Total table sizes: 11153 bytes (-114 bytes)
Currently the test file takes a while to compile -- 30 seconds or so -- but
since it's not going to be committed, and is just for local testing, that seems
fine.
This avoids wasting a small amount of space for some of the data sets.

The chunk resizing is caused by but not directly related to changes in this
commit.

Alphabetic     : 3036 bytes
Case_Ignorable : 2133 bytes    (- 3 bytes)
Cased          : 934 bytes
Cc             : 32 bytes
Grapheme_Extend: 1760 bytes    (-14 bytes)
Lowercase      : 985 bytes
N              : 1220 bytes    (- 5 bytes)
Uppercase      : 934 bytes
White_Space    : 97 bytes
Total table sizes: 11131 bytes (-22 bytes)
Previously, all words in the (deduplicated) bitset would be stored raw -- a full
64 bits (8 bytes). Now, those words that are equivalent to others through a
specific mapping are stored separately and "mapped" to the original when
loading; this shrinks the table sizes significantly, as each mapped word is
stored in 2 bytes (a 4x decrease from the previous).

The new encoding is also potentially non-optimal: the "mapped" byte is
frequently repeated, as in practice many mapped words use the same base word.

Currently we only support two forms of mapping: rotation and inversion. Note
that these are both guaranteed to map transitively if at all, and supporting
mappings for which this is not true may require a more interesting algorithm for
choosing the optimal pairing.

Updated sizes:

Alphabetic     : 2622 bytes     (-  414 bytes)
Case_Ignorable : 1803 bytes     (-  330 bytes)
Cased          : 808 bytes      (-  126 bytes)
Cc             : 32 bytes
Grapheme_Extend: 1508 bytes     (-  252 bytes)
Lowercase      : 901 bytes      (-   84 bytes)
N              : 1064 bytes     (-  156 bytes)
Uppercase      : 838 bytes      (-   96 bytes)
White_Space    : 91 bytes       (-    6 bytes)
Total table sizes: 9667 bytes   (-1,464 bytes)
This saves less bytes - by far - and is likely not the best operator to choose.
But for now, it works -- a better choice may arise later.

Alphabetic     : 2538 bytes   (- 84 bytes)
Case_Ignorable : 1773 bytes   (- 30 bytes)
Cased          : 790 bytes    (- 18 bytes)
Cc             : 26 bytes     (-  6 bytes)
Grapheme_Extend: 1490 bytes   (- 18 bytes)
Lowercase      : 865 bytes    (- 36 bytes)
N              : 1040 bytes   (- 24 bytes)
Uppercase      : 778 bytes    (- 60 bytes)
White_Space    : 85 bytes     (-  6 bytes)
Total table sizes: 9385 bytes (-282 bytes)
This ensures that what we test is what we get for final results as well.
This optimizes slightly better.

Alphabetic     : 2536 bytes
Case_Ignorable : 1771 bytes
Cased          : 788 bytes
Cc             : 24 bytes
Grapheme_Extend: 1488 bytes
Lowercase      : 863 bytes
N              : 1038 bytes
Uppercase      : 776 bytes
White_Space    : 83 bytes
Total table sizes: 9367 bytes  (-18 bytes; 2 bytes per set)
We find that it is common for large ranges of chars to be false -- and that
means that it is plausibly common for us to ask about a word that is entirely
empty. Therefore, we should make sure that we do not need to rotate bits or
otherwise perform some operation to map to the zero word; canonicalize it first
if possible.
LLVM seems to at least sometimes optimize better when the length comes directly
from the `len()` of the array vs. an equivalent integer.

Also, this allows easier copy/pasting of the function into compiler explorer for
experimentation.
This arranges for the sparser sets (everything except lower and uppercase) to be
encoded in a significantly smaller context. However, it is also a performance
trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction
is deemed to be sufficiently important to merit this performance loss,
particularly as it is unlikely that this code is hot anywhere (and if it is,
paying the memory cost for a bitset that directly represents the data seems
worthwhile).

Alphabetic     : 1599 bytes     (- 937 bytes)
Case_Ignorable : 949 bytes      (- 822 bytes)
Cased          : 359 bytes      (- 429 bytes)
Cc             : 9 bytes        (-  15 bytes)
Grapheme_Extend: 813 bytes      (- 675 bytes)
Lowercase      : 863 bytes
N              : 419 bytes      (- 619 bytes)
Uppercase      : 776 bytes
White_Space    : 37 bytes       (-  46 bytes)
Total table sizes: 5824 bytes   (-3543 bytes)
In practice, for the two data sets that still use the bitset encoding (uppercase
and lowercase) this is not a significant win, so just drop it entirely. It costs
us about 5 bytes, and the complexity is nontrivial.
Mozilla's IRC service was shut down in March 2020. The official
instant messaging variant has been Discord for a while, and most of
the links were already replaced by rust-lang#61524.

This was the last line that came up with `irc.mozilla.org` or any
combination of "irc.*#[a-z]+" in a `git grep`:

    git grep -i -E "irc.*#[a-z]+"

As there is only one other link directly to Rust's discord, I used the
same Markdown link `[rust-discord]` as in `bootstrap/README.md` to
stay consistent. This might come in handy if the chat platform changes
at a later point again.

As an aside: for those interested in the use of IRC, Mozilla's [wiki]
still offers a lot of in-depth knowledge.

 [wiki]: https://wiki.mozilla.org/IRC
Add long error explanation for E0703

Add long explanation for the E0703 error code
Part of rust-lang#61137

r? @GuillaumeGomez
…t-dir, r=GuillaumeGomez

Create output dir in rustdoc markdown render

`rustdoc` command on a standalone markdown document fails because the output directory (which default to `doc/`) is not created, even when specified with the `--output` argument.

This PR adds the creation of the output directory before the file creation to avoid an unexpected error which is unclear.

I am not sure about the returned error code. I did not find a table explaining them. So I simply put the same error code that is returned when `File::create` fails because they are both related to file-system errors.

Resolve rust-lang#70431
…tolnay

Shrink Unicode tables (even more)

This shrinks the Unicode tables further, building upon the wins in rust-lang#68232 (the previous counts differ due to an interim Unicode version update, see rust-lang#69929.

The new data structure is slower by around 3x, on the benchmark of looking up every Unicode scalar value in each data set sequentially in every data set included. Note that for ASCII, the exposed functions on `char` optimize with direct branches, so ASCII will retain the same performance regardless of internal optimizations (or the reverse). Also, note that the size reduction due to the skip list (from where the performance losses come) is around 40%, and, as a result, I believe the performance loss is acceptable, as the routines are still quite fast. Anywhere where this is hot, should probably be using a custom data structure anyway (e.g., a raw bitset) or something optimized for frequently seen values, etc.

This PR updates both the bitset data structure, and introduces a new data structure similar to a skip list. For more details, see the [main.rs] of the table generator, which describes both. The commits mostly work individually and document size wins.

As before, this is tested on all valid chars to have the same results as nightly (and the canonical Unicode data sets), happily, no bugs were found.

[main.rs]: https://github.com/rust-lang/rust/blob/fb4a715e18b/src/tools/unicode-table-generator/src/main.rs

Set             | Previous |  New  |  % of old  | Codepoints | Ranges |
----------------|---------:|------:|-----------:|-----------:|-------:|
Alphabetic      |     3055 |  1599 |        52% |     132875 |    695 |
Case Ignorable  |     2136 |   949 |        44% |       2413 |    410 |
Cased           |      934 |   359 |        38% |       4286 |    141 |
Cc              |       43 |     9 |        20% |         65 |      2 |
Grapheme Extend |     1774 |   813 |        46% |       1979 |    344 |
Lowercase       |      985 |   867 |        88% |       2344 |    652 |
N               |     1266 |   419 |        33% |       1781 |    133 |
Uppercase       |      934 |   777 |        83% |       1911 |    643 |
White_Space     |      140 |    37 |        26% |         25 |     10 |
----------------|----------|-------|------------|------------|--------|
Total           |    11267 |  5829 |        51% |     -      |   -    |
…Gomez

Fix rustdoc.css CSS tab-size property

This fixes the CSS tab size property names which are called `tab-size` / `-moz-tab-size` and not `tab-width`

Old issue rust-lang#49155 and related PR rust-lang#50947

tab-size: https://developer.mozilla.org/en-US/docs/Web/CSS/tab-size
Replace last mention of IRC with Discord

Mozilla's IRC service was shut down in March 2020. The official instant messaging variant has been Discord for a while, and most of the links were already replaced by rust-lang#61524.

This was the last line that came up with `irc.mozilla.org` or any combination of "irc.*#[a-z]+" in a `git grep`:

    git grep -i -E "irc.*#[a-z]+"

As there is only one other link directly to Rust's discord, I used the same Markdown link `[rust-discord]` as in `bootstrap/README.md` to stay consistent. This might come in handy if the chat platform changes at a later point again.
@Dylan-DPC-zz
Copy link
Author

@bors r+ rollup=never p=5

@bors
Copy link
Collaborator

bors commented Mar 28, 2020

📌 Commit e3ccd5b has been approved by Dylan-DPC

@bors bors added the S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. label Mar 28, 2020
@bors
Copy link
Collaborator

bors commented Mar 28, 2020

⌛ Testing commit e3ccd5b with merge c52cee1...

@bors
Copy link
Collaborator

bors commented Mar 28, 2020

☀️ Test successful - checks-azure
Approved by: Dylan-DPC
Pushing c52cee1 to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Mar 28, 2020
@bors bors merged commit c52cee1 into rust-lang:master Mar 28, 2020
@bors bors mentioned this pull request Mar 28, 2020
@Dylan-DPC-zz Dylan-DPC-zz deleted the rollup-f9je1l8 branch March 28, 2020 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants