-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[persist] Thread chunks of data from consolidation to the batch builder #29577
Conversation
0ec24a0
to
1045da9
Compare
In case interesting, here is my thinking for the batch building estimates:
The remaining issue is that consolidation may in some circumstances produce chunks that are too small (if. eg. we don't have enough data downloaded yet). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Just a couple of thoughts
src/persist-client/src/iter.rs
Outdated
// Keep a running estimate of the size left in the budget, returning None after it's | ||
// exhausted. Note that we can't use take_while here... it's important that we pass | ||
// on every value we receive from the iterator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might misunderstsand what this comment means, but I'm pretty sure we stop returning elements once the budget == 0
? Here's an example from the Rust Playground
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct! The idea of the second half of this comment is that it would be bad for us to pull an element from the consolidating iter and not foward it along, so we can't use take_while
. (Which drops the first non-matching element.) I'll rephrase.
// This is a noop if there are no such updates. | ||
// TODO: consider moving the individual updates to BatchBuilder? | ||
let previous = self.buffer.drain(); | ||
self.flush_part(previous).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for always flushing the previous parts? It seems like it might be nice to flush the current updates with the previous into a single blob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for always flushing the previous parts?
Simplest thing to do, basically... we only hit this case if someone was mixing the two types of calls, which never happens in practice.
It seems like it might be nice to flush the current updates with the previous into a single blob?
There's some risk that it outputs a blob that's larger than our target size. (So we could make it conditional, but for the above reason I'm not inclined to put much effort in!.)
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:This pull request carries a high risk score of 82, driven by predictors such as the average age of files, cognitive complexity within files, and the delta of executable lines. Historically, PRs with these predictors are 102% more likely to cause a bug than the repository baseline. Notably, the observed bug trend in the repository is decreasing. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. |
Thanks all! Think I've addressed all the blockers, but if anything doesn't look right I'm happy to take it in a followup. |
This is a vestige of back when we had two sets of schemas in compaction, which was weird and is long gone. Enforce that all the data in a batch has the same schema.
Following the similar method on columnar records.
If this is more than a small faction of bytes, it's worth tuning consolidation to return larger chunks of output.
…across_restarts To prevent stack overflows
Ran into a test failure after rebasing on main, and I think I've figured it out. The However, the fix was reverted by @jkosh44 in #29593. I don't see a rationale in the PR desc - Joe, was that intentional and/or would the test change to unrevert? I've re-applied Dennis's fix on this branch, and it does seem to get the test passing again. |
I think I understand what happened.
So, no, it was not intentional. Please unrevert bf8dcd9. |
Ah, that'll do it! Thanks for chasing that down. |
Sorry about that! I had to mix that in to get cargo test green |
No need to apologize -- far from it! I was in fact very happy to find out that someone had already figured out a fix for the weird test failure I was hitting... |
The upshot is that it should now be possible to go from input to output in compaction without converting from codec-encoded data to structured data or vice-versa; the process can work in terms of structured data from top to bottom.
Motivation
Subtask for MaterializeInc/database-issues#7411.
Tips for reviewer
Chunking up compaction output is a little tricky, and I will probably want to make some changes in a followup. It's all behind a flag for now of course!
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.