Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the implementation of size precomputation for binary codecs #69

Merged
merged 5 commits into from
Jun 14, 2021

Conversation

craigfe
Copy link
Member

@craigfe craigfe commented Jun 12, 2021

The implementation now makes a distinction between three states of knowledge about the size of a codec:

  • Static n: the encoding always occupies exactly n bytes;
  • Dynamic f: the encoding has variable size, pre-computed by f;
  • Unknown: the encoding size cannot be efficiently pre-computed.

This allows some of the work in determining the size of a codec to be computed at specialisation time, cutting down significantly on the allocation cost.

Additionally, the implementation is now also able to efficiently recover the size of an encoding given a pointer to the start of that encoding (i.e. without actually decoding the value). In this case, Dynamic f is a function that decodes the size of an encoding, and Unknown signifies that this cannot be done (e.g. for unboxed values).

Fixes #65. Depends on #68.

@craigfe
Copy link
Member Author

craigfe commented Jun 12, 2021

Performance change to the existing Repr.size_of is mostly within noise, but there are some wins due to not using an option monad:

/size_of/pair_string_int/
  minor-allocated (words)  10 -> 2

/size_of/record/
  minor-allocated (words)  34     -> 2
  monotonic-clock (ns)     38.627 -> 16.911

/size_of/tree/
  major-allocated (words)  73.6    -> 0
  minor-allocated (words)  327_648 -> 131_067
  monotonic-clock (ns)     591_883 -> 442_304

/size_of/triple_short_int/
  minor-allocated (words)  20 -> 2

craigfe added a commit to craigfe/irmin that referenced this pull request Jun 12, 2021
... using the improved `size_of` functions provided by the latest Repr
(mirage/repr#69).

The new implementation avoids decoding Inode / Node values from the read
buffer, which avoids needing to use the path dictionary and reduces the
number of allocations significantly. (As a result, certain corruptions
of the data file itself will no longer be noticed during index
reconstruction, but we have other integrity checks for those.)

Also exposes an `index_log_size` parameter to allow the user to override
the default. To compensate for using very large log sizes during
reconstruction, we now attempt a manual merge at the end of the process.
The implementation now makes a distinction between three states of
knowledge about the size of a codec:

- `Static n`: the encoding always occupies exactly `n` bytes;
- `Dynamic f`: the encoding has variable size, pre-computed by `f`;
- `Unknown`: the encoding size cannot be efficiently pre-computed.

This allows some of the work in determining the size of a codec to be
computed at specialisation time, cutting down significantly on the
allocation cost.

Additionally, the implementation is now also able to efficiently
_recover_ the size of an encoding given a pointer to the start of that
encoding (i.e. without actually decoding the value). In this case,
`Dynamic f` is a function that decodes the size of an encoding, and
`Unknown` signifies that this cannot be done (e.g. for unboxed values).

Fix mirage#65.
craigfe added a commit to craigfe/irmin that referenced this pull request Jun 12, 2021
... using the improved `size_of` functions provided by the latest Repr
(mirage/repr#69).

The new implementation avoids decoding Inode / Node values from the read
buffer, which avoids needing to use the path dictionary and reduces the
number of allocations significantly. (As a result, certain corruptions
of the data file itself will no longer be noticed during index
reconstruction, but we have other integrity checks for those.)

Also exposes an `index_log_size` parameter to allow the user to override
the default. To compensate for using very large log sizes during
reconstruction, we now attempt a manual merge at the end of the process.
Very similar to eb83002. The generic
pretty-printers are another example of an interpreter that has special
exceptions for "primitive" types, and introducing `Attribute` wrappers
around these types interfere with the heuristic.

In particular, this change caused Irmin tests to fail due to added
quotation marks around hashes when printed.
@craigfe
Copy link
Member Author

craigfe commented Jun 13, 2021

I've added f7d1a9a to this PR, which is a small fix necessary to get Irmin's CI to pass upstream here: mirage/irmin#1459.

craigfe added a commit to craigfe/irmin that referenced this pull request Jun 13, 2021
... using the improved `size_of` functions provided by the latest Repr
(mirage/repr#69).

The new implementation avoids decoding Inode / Node values from the read
buffer, which avoids needing to use the path dictionary and reduces the
number of allocations significantly. (As a result, certain corruptions
of the data file itself will no longer be noticed during index
reconstruction, but we have other integrity checks for those.)

Also exposes an `index_log_size` parameter to allow the user to override
the default. To compensate for using very large log sizes during
reconstruction, we now attempt a manual merge at the end of the process.
Copy link
Contributor

@Ngoguey42 Ngoguey42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great :)

test/repr/test_size_of.ml Outdated Show resolved Hide resolved
src/repr/binary_codec.ml Show resolved Hide resolved
src/repr/size.ml Show resolved Hide resolved
Comment on lines +110 to +115
let of_value =
let rec aux len n =
if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7)
in
fun n -> aux 1 n
in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the invariant of this encoding is that the number of necessary bytes is log_2(n) / 7 + 1 so we could write this function as

Suggested change
let of_value =
let rec aux len n =
if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7)
in
fun n -> aux 1 n
in
let of_value =
let l2 = log 2. in
let log2 n = int_of_float (ceil (log (float n) /. l2)) in
fun n -> log2 n / 7 + 1
in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if the float operations (with the corresponding allocations) are faster than iterating up to 9 times on the integer, but maybe it's worth benchmarking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about it. Do you want me to test it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested, go for it :) Bechamel should make any differences clear.

src/repr/binary_codec.ml Outdated Show resolved Hide resolved
src/repr/binary_codec.ml Outdated Show resolved Hide resolved
src/repr/binary_codec.ml Outdated Show resolved Hide resolved
@craigfe
Copy link
Member Author

craigfe commented Jun 14, 2021

Thanks all for the reviews.

@craigfe craigfe merged commit 30e4a2e into mirage:main Jun 14, 2021
craigfe added a commit to craigfe/irmin that referenced this pull request Jun 14, 2021
... using the improved `size_of` functions provided by the latest Repr
(mirage/repr#69).

The new implementation avoids decoding Inode / Node values from the read
buffer, which avoids needing to use the path dictionary and reduces the
number of allocations significantly. (As a result, certain corruptions
of the data file itself will no longer be noticed during index
reconstruction, but we have other integrity checks for those.)

Also exposes an `index_log_size` parameter to allow the user to override
the default. To compensate for using very large log sizes during
reconstruction, we now attempt a manual merge at the end of the process.
craigfe added a commit to craigfe/irmin that referenced this pull request Jun 14, 2021
... using the improved `size_of` functions provided by the latest Repr
(mirage/repr#69).

The new implementation avoids decoding Inode / Node values from the read
buffer, which avoids needing to use the path dictionary and reduces the
number of allocations significantly. (As a result, certain corruptions
of the data file itself will no longer be noticed during index
reconstruction, but we have other integrity checks for those.)

Also exposes an `index_log_size` parameter to allow the user to override
the default. To compensate for using very large log sizes during
reconstruction, we now attempt a manual merge at the end of the process.
craigfe added a commit to craigfe/opam-repository that referenced this pull request Jun 16, 2021
CHANGES:

- Add `Repr.{random,random_state}`, a pair of generic functions for sampling
  random instances of representable types. (mirage/repr#58, @craigfe)

- Add `Repr.Size`, which provides sizing functions for binary codecs that are
  more informative than the existing `Repr.size_of`. Types built using `Repr.v`
  and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69,
  @craigfe)
craigfe added a commit to craigfe/opam-repository that referenced this pull request Jun 16, 2021
CHANGES:

- Add `Repr.{random,random_state}`, a pair of generic functions for sampling
  random instances of representable types. (mirage/repr#58, @craigfe)

- Add `Repr.Size`, which provides sizing functions for binary codecs that are
  more informative than the existing `Repr.size_of`. Types built using `Repr.v`
  and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69,
  @craigfe)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support getting the size of an encoded value without decoding it
3 participants