-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the implementation of size precomputation for binary codecs #69
Conversation
e89e4ff
to
c39fb13
Compare
Performance change to the existing
|
... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.
The implementation now makes a distinction between three states of knowledge about the size of a codec: - `Static n`: the encoding always occupies exactly `n` bytes; - `Dynamic f`: the encoding has variable size, pre-computed by `f`; - `Unknown`: the encoding size cannot be efficiently pre-computed. This allows some of the work in determining the size of a codec to be computed at specialisation time, cutting down significantly on the allocation cost. Additionally, the implementation is now also able to efficiently _recover_ the size of an encoding given a pointer to the start of that encoding (i.e. without actually decoding the value). In this case, `Dynamic f` is a function that decodes the size of an encoding, and `Unknown` signifies that this cannot be done (e.g. for unboxed values). Fix mirage#65.
c39fb13
to
0668acf
Compare
... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.
Very similar to eb83002. The generic pretty-printers are another example of an interpreter that has special exceptions for "primitive" types, and introducing `Attribute` wrappers around these types interfere with the heuristic. In particular, this change caused Irmin tests to fail due to added quotation marks around hashes when printed.
I've added f7d1a9a to this PR, which is a small fix necessary to get Irmin's CI to pass upstream here: mirage/irmin#1459. |
... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great :)
let of_value = | ||
let rec aux len n = | ||
if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7) | ||
in | ||
fun n -> aux 1 n | ||
in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the invariant of this encoding is that the number of necessary bytes is log_2(n) / 7 + 1
so we could write this function as
let of_value = | |
let rec aux len n = | |
if n >= 0 && n < 128 then len else aux (len + 1) (n lsr 7) | |
in | |
fun n -> aux 1 n | |
in | |
let of_value = | |
let l2 = log 2. in | |
let log2 n = int_of_float (ceil (log (float n) /. l2)) in | |
fun n -> log2 n / 7 + 1 | |
in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if the float operations (with the corresponding allocations) are faster than iterating up to 9 times on the integer, but maybe it's worth benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about it. Do you want me to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're interested, go for it :) Bechamel should make any differences clear.
4244ba5
to
dff2c7a
Compare
Thanks all for the reviews. |
... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.
... using the improved `size_of` functions provided by the latest Repr (mirage/repr#69). The new implementation avoids decoding Inode / Node values from the read buffer, which avoids needing to use the path dictionary and reduces the number of allocations significantly. (As a result, certain corruptions of the data file itself will no longer be noticed during index reconstruction, but we have other integrity checks for those.) Also exposes an `index_log_size` parameter to allow the user to override the default. To compensate for using very large log sizes during reconstruction, we now attempt a manual merge at the end of the process.
CHANGES: - Add `Repr.{random,random_state}`, a pair of generic functions for sampling random instances of representable types. (mirage/repr#58, @craigfe) - Add `Repr.Size`, which provides sizing functions for binary codecs that are more informative than the existing `Repr.size_of`. Types built using `Repr.v` and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69, @craigfe)
CHANGES: - Add `Repr.{random,random_state}`, a pair of generic functions for sampling random instances of representable types. (mirage/repr#58, @craigfe) - Add `Repr.Size`, which provides sizing functions for binary codecs that are more informative than the existing `Repr.size_of`. Types built using `Repr.v` and `Repr.like` must now pass a sizer built using `Repr.Size.custom_*`. (mirage/repr#69, @craigfe)
The implementation now makes a distinction between three states of knowledge about the size of a codec:
Static n
: the encoding always occupies exactlyn
bytes;Dynamic f
: the encoding has variable size, pre-computed byf
;Unknown
: the encoding size cannot be efficiently pre-computed.This allows some of the work in determining the size of a codec to be computed at specialisation time, cutting down significantly on the allocation cost.
Additionally, the implementation is now also able to efficiently recover the size of an encoding given a pointer to the start of that encoding (i.e. without actually decoding the value). In this case,
Dynamic f
is a function that decodes the size of an encoding, andUnknown
signifies that this cannot be done (e.g. for unboxed values).Fixes #65. Depends on #68.