-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Grand Metadata Reform #22971
The Grand Metadata Reform #22971
Commits on Mar 3, 2015
-
metadata: Avoid the use of raw
wr_str
orwrite_all
.They are, with a conjunction of `start_tag` and `end_tag`, commonly used to write a document with a binary data of known size. However the use of `start_tag` makes the length always 4 bytes long, which is almost not optimal (requiring the relaxation step to remedy). Directly using `wr_tagged_*` methods is better for both readability and resulting metadata size.
Configuration menu - View commit details
-
Copy full SHA for ac20ded - Browse repository at this point
Copy the full SHA ac20dedView commit details -
metadata: New tag encoding scheme.
EBML tags are encoded in a variable-length unsigned int (vuint), which is clever but causes some tags to be encoded in two bytes while there are really about 180 tags or so. Assuming that there wouldn't be, say, over 1,000 tags in the future, we can use much more efficient encoding scheme. The new scheme should support at most 4,096 tags anyway. This also flattens a scattered tag namespace (did you know that 0xa9 is followed by 0xb0?) and makes a room for autoserialized tags in 0x00 through 0x1f.
Configuration menu - View commit details
-
Copy full SHA for 38a965a - Browse repository at this point
Copy the full SHA 38a965aView commit details -
metadata: Introduce implicit lengths for auto-serialization.
Many auto-serialization tags are fixed-size (note: many ordinary tags are also fixed-size but for now this commit ignores them), so having an explicit length is a waste. This moves any auto-serialization tags with an implicit length before other tags, so a test for them is easy. A preliminary experiment shows this has at least 1% gain over the status quo.
Configuration menu - View commit details
-
Copy full SHA for c9840b6 - Browse repository at this point
Copy the full SHA c9840b6View commit details -
metadata: Eliminate the
EsEnumBody
tag.It doesn't serve any useful purpose. It *might* be useful when there are some tags that are generated by `Encodable` and not delimited by any tags, but IIUC it's not the case. Previous: <-------------------- len1 -------------------> EsEnum <len1> EsEnumVid <vid> EsEnumBody <len2> <arg1> <arg2> <--- len2 --> Now: <----------- len1 ----------> EsEnum <len1> EsEnumVid <vid> <arg1> <arg2>
Configuration menu - View commit details
-
Copy full SHA for 2f3aa0d - Browse repository at this point
Copy the full SHA 2f3aa0dView commit details -
metadata: Bye bye
EsLabel
. No regrets.For the reference, while it is designed to be selectively enabled, it was essentially enabled throughout every snapshot and nightly as far as I can tell. This makes the usefulness of `EsLabel` itself questionable, as it was quite rare that `EsLabel` broke the build. It had consumed about 20~30% of metadata (!) and so this should be a huge win.
Configuration menu - View commit details
-
Copy full SHA for 35c798b - Browse repository at this point
Copy the full SHA 35c798bView commit details -
metadata: Introduce
EsSub8
andEsSub32
tags.They replace the existing `EsEnumVid`, `EsVecLen` and `EsMapLen` tags altogether; the meaning of them can be easily inferred from the enclosing tag. It also has an added benefit of encodings for smaller variant ids or lengths being more compact (5 bytes to 2 bytes).
Configuration menu - View commit details
-
Copy full SHA for de00b85 - Browse repository at this point
Copy the full SHA de00b85View commit details -
metadata: Implement relaxation of short RBML lengths.
We try to move the data when the length can be encoded in the much smaller number of bytes. This interferes with indices and type abbreviations however, so this commit introduces a public interface to get and mark a "stable" (i.e. not affected by relaxation) position of the current pointer. The relaxation logic only moves a small data, currently at most 256 bytes, as moving the data can be costly. There might be further opportunities to allow more relaxation by moving fields around, which I didn't seriously try.
Configuration menu - View commit details
-
Copy full SHA for 84e9a61 - Browse repository at this point
Copy the full SHA 84e9a61View commit details -
metadata: Space-optimize empty vectors and maps.
So that `EsVec 82 EsSub8 00` becomes `EsVec 80` now.
Configuration menu - View commit details
-
Copy full SHA for 7b6e43c - Browse repository at this point
Copy the full SHA 7b6e43cView commit details -
metadata: Flatten
tag_table_id
andtag_table_val
tags.This avoids a biggish eight-byte `tag_table_id` tag in favor of autoserialized integer tags, which are smaller and can be later used to encode them in the optimal number of bytes. `NodeId` was u32 after all. Previously: <------------- len1 --------------> tag_table_* <len1> tag_table_id 88 <nodeid in 8 bytes> tag_table_val <len2> <actual data> <-- len2 ---> Now: <--------------- len ---------------> tag_table_* <len> U32 <nodeid in 4 bytes> <actual data>
Configuration menu - View commit details
-
Copy full SHA for 36a09a1 - Browse repository at this point
Copy the full SHA 36a09a1View commit details -
metadata: Compact integer encoding.
Previously every auto-serialized tags are strongly typed. However this is not strictly required, and instead it can be exploited to provide the optimal encoding for smaller integers. This commit repurposes `EsI8`/`EsU8` through `EsI64`/`EsU64` tags to represent *any* integers with given ranges: It is now possible to encode `42u64` as two bytes `EsU8 0x2a`, for example. There are some limitations: * It does not apply to non-auto-serialized tags for obvious reasons. Fortunately, we have already eliminated the biggest source of such tag in favor of auto-serialized tags: `tag_table_id`. * Bigger tags cannot be used to represent smaller types. * Signed tags and unsigned tags do not mix.
Configuration menu - View commit details
-
Copy full SHA for fe73d38 - Browse repository at this point
Copy the full SHA fe73d38View commit details -
metadata: Bump the metadata encoding version.
We have changed the encoding enough to bump that. Also added some notes about metadata encoding to librbml/lib.rs.
Configuration menu - View commit details
-
Copy full SHA for ef3c7af - Browse repository at this point
Copy the full SHA ef3c7afView commit details -
metadata: Reordered integral tags in the ascending order.
Also clarified the mysterious `_next_int` method.
Configuration menu - View commit details
-
Copy full SHA for 2008b54 - Browse repository at this point
Copy the full SHA 2008b54View commit details