-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor commit encoding / decoding #495
Conversation
Use `sats::{BufReader, BufWriter}` for decoding / encoding of `Commit` and associated types. This makes `decode` fallible (which is quite desirable, instead of panicking). As the `DecodeError` from sats is fairly sparse, also add some context about where exactly decoding failed. Lastly, add some documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great.
I left a few minor sorts of comments about comments and names.
I'd also like to verify this isn't a performance regression. I don't think this code is a bottleneck for us anymore, but it sure used to be.
Is the restart smoke test failure related?
|
||
pub fn datakey() -> impl Strategy<Value = DataKey> { | ||
prop_oneof![ | ||
prop::collection::vec(any::<u8>(), 0..31).prop_map(DataKey::from_data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd comment on the reason for the two ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub struct Write { | ||
pub operation: Operation, | ||
pub set_id: u32, | ||
pub set_id: u32, // aka table id | ||
#[cfg_attr(test, proptest(strategy = "arbitrary::datakey()"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also comment about directing this to a hand-written strategy, or in general anywhere that the derive macro doesn't do the right thing on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fine to have those on the mod arbitrary
intentionally kept close the to type definition? See ea62cf8
#[cfg(test)] | ||
use proptest_derive::Arbitrary; | ||
|
||
/// A commit is one record in the write-ahead log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A drive by thought: similarly to how we have a bunch of code that says set_id
but commented "this is the table id"
, I would love to settle on saying "write ahead log" everywhere, including data type names etc. Or some other name. But let's pick one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd love to settle on WAL / write ahead log, maybe even collapsing CommitLog
and MessageLog
into a single type. Possibly not for this PR, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'd love to have us try to take things that way. I agree on the naming, I think the commit log and message log names are implementation details and that "write ahead log" is the name that reflects the semantics the datastore needs to care about. But yeah not for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding my two cents here. I agree on coalescing on the name. I avoided WAL because historically that's referred to a log that is eventually compressed/deleted, but I think on balance WAL is probably the best name. We can include a comment about the fact that the WAL should never be deleted.
We should also note that it will include information which is not normally in a WAL, including reducer call event info and that it forms a Merkle DAG between commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term MessageLog
was borrowed from Kafka.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted to include all of these in the upcoming formalization / design changes doc.
Hm, would this entail landing a separate patch with a benchmark before this, so we can track the difference? |
I don't think so. I meant regression at the level of in benchmarks / game performance, not in any new functionality or anything. Like if worldgen goes from N seconds to 1.1N seconds, I'd want to look at this again, but if it stays at N seconds, send it. |
benchmarks please |
Let's try this :) |
Benchmark resultsBenchmark ReportLegend:
All throughputs are single-threaded. Empty transaction
Single-row insertions
Multi-row insertions
Full table iterate
Find unique key
Filter
Serialize
Module: invoke with large arguments
Module: print bulk
Remaining benchmarks
|
Looks like it was a flake. |
let mut count = 0; | ||
|
||
if self.parent_commit_hash.is_none() { | ||
count += 1; | ||
} else { | ||
count += 1; | ||
count += self.parent_commit_hash.unwrap().data.len(); | ||
count += 1; // tag for option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: collapse these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// [`DataKey`] is defined in `lib`, so we can't have an [`Arbitrary`] impl | ||
// for it just yet due to orphan rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'd do in this situation is to move the impl to that crate and expose the impl under a proptest
feature that can be enabled in dev-dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I tried to keep changes local for this one, but happy to send a follow-up defining Arbitrary
for Hash
and DataKey
. Perhaps that'd even increase probability for folks to write property tests ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bug in encoded_len
for Write
.
// 1 for flags, 4 for set_id | ||
let mut count = 1 + 4; | ||
let mut count = self.operation.encoded_len(); | ||
count += 4; // set_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. Shouldn't it be 5 based on the operation encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s operation.len + 4 = 1+4 = 5, like before no?
@@ -91,21 +131,36 @@ impl Commit { | |||
count | |||
} | |||
|
|||
pub fn encode(&self, bytes: &mut Vec<u8>) { | |||
bytes.reserve(self.encoded_len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unfortunate that we're losing this optimization, but maybe we can improve things by not using a Vec at all but a buffer pool or something in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I meant to make sure the caller takes care of this, will add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also comment on the scope of proptest_config.
It was not a bug, and I added a test to assure it's not.
* core: Refactor commit encoding / decoding Use `sats::{BufReader, BufWriter}` for decoding / encoding of `Commit` and associated types. This makes `decode` fallible (which is quite desirable, instead of panicking). As the `DecodeError` from sats is fairly sparse, also add some context about where exactly decoding failed. Lastly, add some documentation and (property) tests.
* core: Refactor commit encoding / decoding Use `sats::{BufReader, BufWriter}` for decoding / encoding of `Commit` and associated types. This makes `decode` fallible (which is quite desirable, instead of panicking). As the `DecodeError` from sats is fairly sparse, also add some context about where exactly decoding failed. Lastly, add some documentation and (property) tests.
Description of Changes
Use
sats::{BufReader, BufWriter}
for decoding / encoding ofCommit
and associated types. This makes
decode
fallible (which is quitedesirable, instead of panicking).
As the
DecodeError
from sats is fairly sparse, also add some contextabout where exactly decoding failed.
Also add documentation and property test.
API and ABI
If the API is breaking, please state below what will break
Expected complexity level and risk
1