Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata is too big for its own good #21482

Closed
alexcrichton opened this issue Jan 21, 2015 · 15 comments
Closed

Metadata is too big for its own good #21482

alexcrichton opened this issue Jan 21, 2015 · 15 comments
Labels
A-metadata Area: Crate metadata metabug Issues about issues themselves ("bugs about bugs")

Comments

@alexcrichton
Copy link
Member

Couldn't find a previous issue on this, so I'd like to open a tracking issue for this. We've known this for a long time, but the metadata format for the compiler is far too large and there are surely methods to shrink its size and impact. Today when I compile librustc, I get the following numbers:

  • librustc.rlib - 64MB
    • rustc.o - 12MB
    • rust.metadata.bin - 32MB
    • rustc.0.bytecode.deflate - 21MB

This means that the metadata is three times as large as the code we're generating. Another statistic is that 36% of the binary data of the nightly is metadata.

There are, however, a number of competing concerns around metadata:

  • Reading metadata needs to be fast. Rustc reads a lot of metadata for upstream crates, and it needs to quickly be able to read the minimum set of metadata for crates. This is currently achieved by storing metadata uncompressed in rlibs to allow LLVM to mmap it directly into the address space and page it in for reading.
  • Metadata needs to be fairly free-form to allow encoding various types of data into it. Ideally it's also extensible into the future so at some point we can use newer compilers against older libraries (this currently not possible for other reasons).
  • All libraries need metadata available in them (currently). This means that if a library produces a dylib/rlib pair (the stdlib is one of the few that does this) then the metadata is duplicated among artifacts. It also means that metadata must be suitable to place inside of a dynamic library.

There are a few open issues on this already, but none of them are necessarily a silver bullet. Here's a smattering of wishlist ideas or various strategies.

More will likely be added to this over time as it's a metabug.

@alexcrichton alexcrichton added A-metadata Area: Crate metadata metabug Issues about issues themselves ("bugs about bugs") labels Jan 21, 2015
@lifthrasiir
Copy link
Contributor

For your information, I had written a custom metadata decoder in Python for debugging #15309 (the table is a bit out of date, but still works), and out of curiosity I've also made some basic efforts to reduce the metadata overhead. The rewrite code does the job, and spews the following outputs for 2015-01-20 nightly's lib/libstd-4e7c5e5c.so:

# original compressed size, the total size of given binary (*.so).
2117863 5081718
# uncompressed size of various encoding strategies.
# - orig: the original (unaltered)
# - relax: optimized size fields. the original metadata uses lots of
#   4-byte-long sizes even when less bytes are sufficient;
#   reclaiming them will require some works.
# - no-label: relax + no `Label` node. used for debugging purpose but
#   never disabled afaik.
# - size-elide: relax + one-byte tag. all tags are <0x100, so ignoring EBML
#   (requires 2-byte encoding for tags >=0x80) gives some gains.
#   also do not add sizes for known fixed-size tags (e.g. `U64`).
# - size-elide-2-or-4: one-byte tag + another relaxation strategy.
#   uses different size encoding algorithm: 2 bytes (big endian)
#   for sizes <0x8000, 4 bytes with MSB set otherwise.
#   trade-off between size and performance.
# - size-elide-4: one-byte tag + fixed 4-byte-long size.
orig 16084126 relax 13577526 no-label 8654851 size-elide 13019868 size-elide-2-or-4 14418657 size-elide-4 17563943
# recompressed (zlib -9) size of above.
# note that the original compressed size is *not* optimal.
orig 2087004 relax 1991335 no-label 1747192 size-elide 1966400 size-elide-2-or-4 2014455 size-elide-4 2123731

@alexcrichton
Copy link
Member Author

Wow, those are some nice wins! I had no idea that existed! If we could implement some of those optimizations today that would be awesome.

@lifthrasiir
Copy link
Contributor

@alexcrichton Does any breaking modification to metadata need a new snapshot? I'm afraid if such modifications cannot be easily done incrementally.

@alexcrichton
Copy link
Member Author

Thankfully you shouldn't need a snapshot, the stage N compiler conveniently only ever reads metadata generated by itself so there's no bootstrapping issues.

@michaelwoerister
Copy link
Member

storing metadata uncompressed in rlibs to allow LLVM to mmap it directly into the address space and page it in for reading.

Is this true? I was under the impression that:

  1. We zip compress LLVM bitcode before storing it
  2. We do not compress other metadata (ASTs mostly) which isn't used by LLVM.

@lifthrasiir
Copy link
Contributor

@michaelwoerister Rustc uses LLVM's memory map abstraction to mmap the executable. LLVM itself does not use the metadata.

@michaelwoerister
Copy link
Member

@lifthrasiir
Copy link
Contributor

I'm currently working on two temporary but public branches:

  • One branch that does not change the metadata format itself (compact-metadata)
  • One branch that actively changes the metadata format (metadata-reform), which is intended to be rebased after the former

Any suggestions or patches would be appreciated.

bors added a commit that referenced this issue Mar 3, 2015
This is a series of individual but correlated changes to the metadata format. The changes are significant enough that it (finally) bumps the metadata encoding version. In brief, they altogether reduce the total size of stage1 binaries by 27% (!!!!). Almost every low-hanging fruit has been considered and fixed; see the individual commits for details.

Detailed library (not just metadata) size changes for x86_64-unknown-linux-gnu stage1 binaries (baseline being 3a96d6a):

````
   before     after  delta path
--------- --------- ------ --------------------------------
  1706146   1050412  38.4% liballoc-4e7c5e5c.rlib
   398576    152454  61.8% libarena-4e7c5e5c.rlib
    71441     56892  20.4% libarena-4e7c5e5c.so
 14424754   5084102  64.8% libcollections-4e7c5e5c.rlib
 39143186  14743118  62.3% libcore-4e7c5e5c.rlib
   195574    188150   3.8% libflate-4e7c5e5c.rlib
   153123    152603   0.3% libflate-4e7c5e5c.so
   477152    215262  54.9% libfmt_macros-4e7c5e5c.rlib
    77728     66601  14.3% libfmt_macros-4e7c5e5c.so
  1216936    684104  43.8% libgetopts-4e7c5e5c.rlib
   207846    181116  12.9% libgetopts-4e7c5e5c.so
   349722    147530  57.8% libgraphviz-4e7c5e5c.rlib
    60196     49197  18.3% libgraphviz-4e7c5e5c.so
   729842    259906  64.4% liblibc-4e7c5e5c.rlib
   349358    247014  29.3% liblog-4e7c5e5c.rlib
    88878     83163   6.4% liblog-4e7c5e5c.so
  1968508    732840  62.8% librand-4e7c5e5c.rlib
  1968204    696326  64.6% librbml-4e7c5e5c.rlib
   283207    206589  27.1% librbml-4e7c5e5c.so
 72369394  46401230  35.9% librustc-4e7c5e5c.rlib
 11941372  10498483  12.1% librustc-4e7c5e5c.so
  2717894   1983272  27.0% librustc_back-4e7c5e5c.rlib
   501900    464176   7.5% librustc_back-4e7c5e5c.so
    15058     12588  16.4% librustc_bitflags-4e7c5e5c.rlib
  4008268   2961912  26.1% librustc_borrowck-4e7c5e5c.rlib
   837550    785633   6.2% librustc_borrowck-4e7c5e5c.so
  6473348   6095470   5.8% librustc_driver-4e7c5e5c.rlib
  1448785   1433945   1.0% librustc_driver-4e7c5e5c.so
 95483688  94779704   0.7% librustc_llvm-4e7c5e5c.rlib
 43516815  43487809   0.1% librustc_llvm-4e7c5e5c.so
   938140    817236  12.9% librustc_privacy-4e7c5e5c.rlib
   182653    176563   3.3% librustc_privacy-4e7c5e5c.so
  4390288   3543284  19.3% librustc_resolve-4e7c5e5c.rlib
   872981    831824   4.7% librustc_resolve-4e7c5e5c.so
 1817642  14795426  18.6% librustc_trans-4e7c5e5c.rlib
  3657354   3480026   4.8% librustc_trans-4e7c5e5c.so
 16815076  13868862  17.5% librustc_typeck-4e7c5e5c.rlib
  3274439   3123898   4.6% librustc_typeck-4e7c5e5c.so
 21372308  14890582  30.3% librustdoc-4e7c5e5c.rlib
  4501971   4172202   7.3% librustdoc-4e7c5e5c.so
  8055028   2951044  63.4% libserialize-4e7c5e5c.rlib
   958101    710016  25.9% libserialize-4e7c5e5c.so
 30810208  15160648  50.8% libstd-4e7c5e5c.rlib
  6819003   5967485  12.5% libstd-4e7c5e5c.so
 58850950  31949594  45.7% libsyntax-4e7c5e5c.rlib
  9060154   7882423  13.0% libsyntax-4e7c5e5c.so
  1474310   1062102  28.0% libterm-4e7c5e5c.rlib
   345577    323952   6.3% libterm-4e7c5e5c.so
  2827854   1643056  41.9% libtest-4e7c5e5c.rlib
   517811    452519  12.6% libtest-4e7c5e5c.so
  2274106   1761240  22.6% libunicode-4e7c5e5c.rlib
--------- --------- ------ --------------------------------
499359187 363465583  27.2% total
````

Some notes:

* Uncompressed metadata compacts very well. It is less visible for compressed metadata but still it achieves about 5~10% reduction.
* *Every* commit is designed to reduce the metadata in one way. There is absolutely no negative impact associated to changes (that's why the table above doesn't contain a minus delta).
* I've confirmed that this compiles through `make all`, making it almost correct. Other platforms have to be tested though.
* Oh, I'll rebase this as soon as I have spare time, but I guess this needs an extensive review anyway.
* I haven't rigorously checked the encoder and decoder performance. I tried to minimize the impact (some encodings are actually simpler than the original), but I'm not sure.

Fixes #2743, #9303 (partially) and #21482.
@steveklabnik
Copy link
Member

Given #22971 was merged, is this fixed? It's hard to tell from

Fixes #2743, #9303 (partially) and #21482.

which implies the first and last were total, and the middle one was partial?

@lifthrasiir
Copy link
Contributor

@steveklabnik #2743 is fully fixed. I think I've said #9303 is fixed partially because it does not really fix the naming issue ("Rename it from ebml to atom_trees, change all internal naming"), but in retrospect you can safely close that. I guess this metabug needs to be open since the metadata reduction is an ongoing work (my PR was a sum of low-hanging fruits) and we probably need a central place to discuss that.

@michaelwoerister
Copy link
Member

Some updates pertaining to this issue:

@jonas-schievink
Copy link
Contributor

#35764 has significantly reduced the size of metadata since #[inline]d functions are no longer stored as ASTs - The metadata of libcore was more than halved!

rustup update saw quite an improvement:

info: downloading component 'rustc'
 38.2 MiB /  38.2 MiB (100 %)   1.8 MiB/s ETA:   0 s                
info: downloading component 'rust-std'
 46.1 MiB /  46.1 MiB (100 %)   1.8 MiB/s ETA:   0 s                

Before, it looked like this:

info: downloading component 'rustc'
 49.0 MiB /  49.0 MiB (100 %)   1.8 MiB/s ETA:   0 s                
info: downloading component 'rust-std'
 61.9 MiB /  61.9 MiB (100 %)   1.8 MiB/s ETA:   0 s                

@steveklabnik
Copy link
Member

So, at what point is metadata small enough that this bug can be considered fixed?

@michaelwoerister
Copy link
Member

1 byte, tops

@alexcrichton
Copy link
Member Author

This is a super old and much less relevant issue now, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-metadata Area: Crate metadata metabug Issues about issues themselves ("bugs about bugs")
Projects
None yet
Development

No branches or pull requests

5 participants