Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Grand Metadata Reform #22971

Merged
merged 12 commits into from
Mar 3, 2015
Merged

The Grand Metadata Reform #22971

merged 12 commits into from
Mar 3, 2015

Commits on Mar 3, 2015

  1. metadata: Avoid the use of raw wr_str or write_all.

    They are, with a conjunction of `start_tag` and `end_tag`, commonly
    used to write a document with a binary data of known size. However
    the use of `start_tag` makes the length always 4 bytes long, which
    is almost not optimal (requiring the relaxation step to remedy).
    Directly using `wr_tagged_*` methods is better for both readability
    and resulting metadata size.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    ac20ded View commit details
    Browse the repository at this point in the history
  2. metadata: New tag encoding scheme.

    EBML tags are encoded in a variable-length unsigned int (vuint),
    which is clever but causes some tags to be encoded in two bytes
    while there are really about 180 tags or so. Assuming that there
    wouldn't be, say, over 1,000 tags in the future, we can use much
    more efficient encoding scheme. The new scheme should support
    at most 4,096 tags anyway.
    
    This also flattens a scattered tag namespace (did you know that
    0xa9 is followed by 0xb0?) and makes a room for autoserialized tags
    in 0x00 through 0x1f.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    38a965a View commit details
    Browse the repository at this point in the history
  3. metadata: Introduce implicit lengths for auto-serialization.

    Many auto-serialization tags are fixed-size (note: many ordinary
    tags are also fixed-size but for now this commit ignores them),
    so having an explicit length is a waste. This moves any
    auto-serialization tags with an implicit length before other tags,
    so a test for them is easy. A preliminary experiment shows this
    has at least 1% gain over the status quo.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    c9840b6 View commit details
    Browse the repository at this point in the history
  4. metadata: Eliminate the EsEnumBody tag.

    It doesn't serve any useful purpose. It *might* be useful when
    there are some tags that are generated by `Encodable` and
    not delimited by any tags, but IIUC it's not the case.
    
    Previous:
    
                      <-------------------- len1 ------------------->
        EsEnum <len1> EsEnumVid <vid> EsEnumBody <len2> <arg1> <arg2>
                                                        <--- len2 -->
    
    Now:
    
                      <----------- len1 ---------->
        EsEnum <len1> EsEnumVid <vid> <arg1> <arg2>
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    2f3aa0d View commit details
    Browse the repository at this point in the history
  5. metadata: Bye bye EsLabel. No regrets.

    For the reference, while it is designed to be selectively enabled,
    it was essentially enabled throughout every snapshot and nightly
    as far as I can tell. This makes the usefulness of `EsLabel` itself
    questionable, as it was quite rare that `EsLabel` broke the build.
    It had consumed about 20~30% of metadata (!) and so this should be
    a huge win.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    35c798b View commit details
    Browse the repository at this point in the history
  6. metadata: Introduce EsSub8 and EsSub32 tags.

    They replace the existing `EsEnumVid`, `EsVecLen` and `EsMapLen`
    tags altogether; the meaning of them can be easily inferred
    from the enclosing tag. It also has an added benefit of
    encodings for smaller variant ids or lengths being more compact
    (5 bytes to 2 bytes).
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    de00b85 View commit details
    Browse the repository at this point in the history
  7. metadata: Implement relaxation of short RBML lengths.

    We try to move the data when the length can be encoded in
    the much smaller number of bytes. This interferes with indices and
    type abbreviations however, so this commit introduces a public
    interface to get and mark a "stable" (i.e. not affected by
    relaxation) position of the current pointer.
    
    The relaxation logic only moves a small data, currently at most
    256 bytes, as moving the data can be costly. There might be
    further opportunities to allow more relaxation by moving fields
    around, which I didn't seriously try.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    84e9a61 View commit details
    Browse the repository at this point in the history
  8. metadata: Space-optimize empty vectors and maps.

    So that `EsVec 82 EsSub8 00` becomes `EsVec 80` now.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    7b6e43c View commit details
    Browse the repository at this point in the history
  9. metadata: Flatten tag_table_id and tag_table_val tags.

    This avoids a biggish eight-byte `tag_table_id` tag in favor of
    autoserialized integer tags, which are smaller and can be later
    used to encode them in the optimal number of bytes. `NodeId` was
    u32 after all.
    
    Previously:
    
                           <------------- len1 -------------->
        tag_table_* <len1> tag_table_id 88 <nodeid in 8 bytes>
                           tag_table_val <len2> <actual data>
                                                <-- len2 --->
    
    Now:
    
                          <--------------- len --------------->
        tag_table_* <len> U32 <nodeid in 4 bytes> <actual data>
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    36a09a1 View commit details
    Browse the repository at this point in the history
  10. metadata: Compact integer encoding.

    Previously every auto-serialized tags are strongly typed. However
    this is not strictly required, and instead it can be exploited
    to provide the optimal encoding for smaller integers. This commit
    repurposes `EsI8`/`EsU8` through `EsI64`/`EsU64` tags to represent
    *any* integers with given ranges: It is now possible to encode
    `42u64` as two bytes `EsU8 0x2a`, for example.
    
    There are some limitations:
    
    * It does not apply to non-auto-serialized tags for obvious reasons.
      Fortunately, we have already eliminated the biggest source of
      such tag in favor of auto-serialized tags: `tag_table_id`.
    * Bigger tags cannot be used to represent smaller types.
    * Signed tags and unsigned tags do not mix.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    fe73d38 View commit details
    Browse the repository at this point in the history
  11. metadata: Bump the metadata encoding version.

    We have changed the encoding enough to bump that.
    Also added some notes about metadata encoding to librbml/lib.rs.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    ef3c7af View commit details
    Browse the repository at this point in the history
  12. metadata: Reordered integral tags in the ascending order.

    Also clarified the mysterious `_next_int` method.
    lifthrasiir committed Mar 3, 2015
    Configuration menu
    Copy the full SHA
    2008b54 View commit details
    Browse the repository at this point in the history