Extend the implementation of size precomputation for binary codecs #69

craigfe · 2021-06-12T16:29:54Z

The implementation now makes a distinction between three states of knowledge about the size of a codec:

Static n: the encoding always occupies exactly n bytes;
Dynamic f: the encoding has variable size, pre-computed by f;
Unknown: the encoding size cannot be efficiently pre-computed.

This allows some of the work in determining the size of a codec to be computed at specialisation time, cutting down significantly on the allocation cost.

Additionally, the implementation is now also able to efficiently recover the size of an encoding given a pointer to the start of that encoding (i.e. without actually decoding the value). In this case, Dynamic f is a function that decodes the size of an encoding, and Unknown signifies that this cannot be done (e.g. for unboxed values).

Fixes #65. Depends on #68.

craigfe · 2021-06-12T16:50:18Z

Performance change to the existing Repr.size_of is mostly within noise, but there are some wins due to not using an option monad:

/size_of/pair_string_int/
  minor-allocated (words)  10 -> 2

/size_of/record/
  minor-allocated (words)  34     -> 2
  monotonic-clock (ns)     38.627 -> 16.911

/size_of/tree/
  major-allocated (words)  73.6    -> 0
  minor-allocated (words)  327_648 -> 131_067
  monotonic-clock (ns)     591_883 -> 442_304

/size_of/triple_short_int/
  minor-allocated (words)  20 -> 2

... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.

The implementation now makes a distinction between three states of knowledge about the size of a codec: - `Static n`: the encoding always occupies exactly `n` bytes; - `Dynamic f`: the encoding has variable size, pre-computed by `f`; - `Unknown`: the encoding size cannot be efficiently pre-computed. This allows some of the work in determining the size of a codec to be computed at specialisation time, cutting down significantly on the allocation cost. Additionally, the implementation is now also able to efficiently _recover_ the size of an encoding given a pointer to the start of that encoding (i.e. without actually decoding the value). In this case, `Dynamic f` is a function that decodes the size of an encoding, and `Unknown` signifies that this cannot be done (e.g. for unboxed values). Fix mirage#65.

... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.

Very similar to eb83002. The generic pretty-printers are another example of an interpreter that has special exceptions for "primitive" types, and introducing `Attribute` wrappers around these types interfere with the heuristic. In particular, this change caused Irmin tests to fail due to added quotation marks around hashes when printed.

craigfe · 2021-06-13T20:08:47Z

I've added f7d1a9a to this PR, which is a small fix necessary to get Irmin's CI to pass upstream here: mirage/irmin#1459.

... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.

Ngoguey42

Great :)

test/repr/test_size_of.ml

src/repr/binary_codec.ml

src/repr/size.ml

mattiasdrp · 2021-06-14T13:52:53Z

src/repr/binary_codec.ml

+    let of_value =
+      let rec aux len n =
+        if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7)
+      in
+      fun n -> aux 1 n
+    in


I think that the invariant of this encoding is that the number of necessary bytes is log_2(n) / 7 + 1 so we could write this function as

Suggested change

let of_value =

let rec aux len n =

if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7)

in

fun n -> aux 1 n

in

let of_value =

let l2 = log 2. in

let log2 n = int_of_float (ceil (log (float n) /. l2)) in

fun n -> log2 n / 7 + 1

in

I'd be surprised if the float operations (with the corresponding allocations) are faster than iterating up to 9 times on the integer, but maybe it's worth benchmarking.

I was wondering about it. Do you want me to test it?

If you're interested, go for it :) Bechamel should make any differences clear.

src/repr/binary_codec.ml

@mattiasdrp

Spotted by @mattiasdrp; thanks!

craigfe · 2021-06-14T16:14:46Z

Thanks all for the reviews.

... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.

@craigfe

CHANGES: - Add `Repr.{random,random_state}`, a pair of generic functions for sampling random instances of representable types. (mirage/repr#58, @craigfe) - Add `Repr.Size`, which provides sizing functions for binary codecs that are more informative than the existing `Repr.size_of`. Types built using `Repr.v` and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69, @craigfe)

@craigfe

CHANGES: - Add `Repr.{random,random_state}`, a pair of generic functions for sampling random instances of representable types. (mirage/repr#58, @craigfe) - Add `Repr.Size`, which provides sizing functions for binary codecs that are more informative than the existing `Repr.size_of`. Types built using `Repr.v` and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69, @craigfe)

craigfe force-pushed the refactor-size-of branch from e89e4ff to c39fb13 Compare June 12, 2021 16:30

craigfe mentioned this pull request Jun 12, 2021

irmin-pack: refactor Index reconstruction implementation mirage/irmin#1459

Merged

craigfe force-pushed the refactor-size-of branch from c39fb13 to 0668acf Compare June 12, 2021 23:17

Ngoguey42 approved these changes Jun 14, 2021

View reviewed changes

test/repr/test_size_of.ml Outdated Show resolved Hide resolved

mattiasdrp reviewed Jun 14, 2021

View reviewed changes

craigfe added 2 commits June 14, 2021 15:27

Strengthen tests of Static value sizers

fca0aa3

Fix bug in implementation of sizing of static-arg option types

dff2c7a

Spotted by @mattiasdrp; thanks!

craigfe force-pushed the refactor-size-of branch from 4244ba5 to dff2c7a Compare June 14, 2021 14:58

Minor comment clarification

6fbd021

craigfe merged commit 30e4a2e into mirage:main Jun 14, 2021

This was referenced Jun 16, 2021

[new release] repr-bench, repr, ppx_repr and repr-fuzz (0.4.0) ocaml/opam-repository#18875

Closed

[new release] repr-bench, repr, ppx_repr and repr-fuzz (0.4.0) ocaml/opam-repository#18876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend the implementation of size precomputation for binary codecs #69

Extend the implementation of size precomputation for binary codecs #69

craigfe commented Jun 12, 2021 •

edited

Loading

craigfe commented Jun 12, 2021

craigfe commented Jun 13, 2021

Ngoguey42 left a comment

mattiasdrp Jun 14, 2021

craigfe Jun 14, 2021

mattiasdrp Jun 14, 2021

craigfe Jun 14, 2021

craigfe commented Jun 14, 2021

Extend the implementation of size precomputation for binary codecs #69

Extend the implementation of size precomputation for binary codecs #69

Conversation

craigfe commented Jun 12, 2021 • edited Loading

craigfe commented Jun 12, 2021

craigfe commented Jun 13, 2021

Ngoguey42 left a comment

Choose a reason for hiding this comment

mattiasdrp Jun 14, 2021

Choose a reason for hiding this comment

craigfe Jun 14, 2021

Choose a reason for hiding this comment

mattiasdrp Jun 14, 2021

Choose a reason for hiding this comment

craigfe Jun 14, 2021

Choose a reason for hiding this comment

craigfe commented Jun 14, 2021

craigfe commented Jun 12, 2021 •

edited

Loading