The Grand Metadata Reform #22971

They are, with a conjunction of `start_tag` and `end_tag`, commonly used to write a document with a binary data of known size. However the use of `start_tag` makes the length always 4 bytes long, which is almost not optimal (requiring the relaxation step to remedy). Directly using `wr_tagged_*` methods is better for both readability and resulting metadata size.

EBML tags are encoded in a variable-length unsigned int (vuint), which is clever but causes some tags to be encoded in two bytes while there are really about 180 tags or so. Assuming that there wouldn't be, say, over 1,000 tags in the future, we can use much more efficient encoding scheme. The new scheme should support at most 4,096 tags anyway. This also flattens a scattered tag namespace (did you know that 0xa9 is followed by 0xb0?) and makes a room for autoserialized tags in 0x00 through 0x1f.

Many auto-serialization tags are fixed-size (note: many ordinary tags are also fixed-size but for now this commit ignores them), so having an explicit length is a waste. This moves any auto-serialization tags with an implicit length before other tags, so a test for them is easy. A preliminary experiment shows this has at least 1% gain over the status quo.

It doesn't serve any useful purpose. It *might* be useful when there are some tags that are generated by `Encodable` and not delimited by any tags, but IIUC it's not the case. Previous: <-------------------- len1 -------------------> EsEnum <len1> EsEnumVid <vid> EsEnumBody <len2> <arg1> <arg2> <--- len2 --> Now: <----------- len1 ----------> EsEnum <len1> EsEnumVid <vid> <arg1> <arg2>

For the reference, while it is designed to be selectively enabled, it was essentially enabled throughout every snapshot and nightly as far as I can tell. This makes the usefulness of `EsLabel` itself questionable, as it was quite rare that `EsLabel` broke the build. It had consumed about 20~30% of metadata (!) and so this should be a huge win.

They replace the existing `EsEnumVid`, `EsVecLen` and `EsMapLen` tags altogether; the meaning of them can be easily inferred from the enclosing tag. It also has an added benefit of encodings for smaller variant ids or lengths being more compact (5 bytes to 2 bytes).

We try to move the data when the length can be encoded in the much smaller number of bytes. This interferes with indices and type abbreviations however, so this commit introduces a public interface to get and mark a "stable" (i.e. not affected by relaxation) position of the current pointer. The relaxation logic only moves a small data, currently at most 256 bytes, as moving the data can be costly. There might be further opportunities to allow more relaxation by moving fields around, which I didn't seriously try.

So that `EsVec 82 EsSub8 00` becomes `EsVec 80` now.

This avoids a biggish eight-byte `tag_table_id` tag in favor of autoserialized integer tags, which are smaller and can be later used to encode them in the optimal number of bytes. `NodeId` was u32 after all. Previously: <------------- len1 --------------> tag_table_* <len1> tag_table_id 88 <nodeid in 8 bytes> tag_table_val <len2> <actual data> <-- len2 ---> Now: <--------------- len ---------------> tag_table_* <len> U32 <nodeid in 4 bytes> <actual data>

Previously every auto-serialized tags are strongly typed. However this is not strictly required, and instead it can be exploited to provide the optimal encoding for smaller integers. This commit repurposes `EsI8`/`EsU8` through `EsI64`/`EsU64` tags to represent *any* integers with given ranges: It is now possible to encode `42u64` as two bytes `EsU8 0x2a`, for example. There are some limitations: * It does not apply to non-auto-serialized tags for obvious reasons. Fortunately, we have already eliminated the biggest source of such tag in favor of auto-serialized tags: `tag_table_id`. * Bigger tags cannot be used to represent smaller types. * Signed tags and unsigned tags do not mix.

We have changed the encoding enough to bump that. Also added some notes about metadata encoding to librbml/lib.rs.

Also clarified the mysterious `_next_int` method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Grand Metadata Reform #22971

The Grand Metadata Reform #22971

Commits on Mar 3, 2015