-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix merge large changes performance #1652
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed apply
and apply_test
(only). Not sure how it passes tests; I think there's a bug, which would mean we need more test coverage. OR (even more likely) I don't understand the code. Either is a good reason to pause reviewing and give you a chance to respond!
Sorry...
iterValue, iterRange := iter.Value() | ||
if iterValue == nil { | ||
if iterRange.IsTombstone() { | ||
// internal error but no data lost: deletion requested of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not counted, I assume on purpose because it's a probably bug? Maybe still count unjustified deletes just to be sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applyFromIter
only returns addition because it runs at when there is no iterator to compare with, so it should have only additions. I could add "deletions" to it, but I don't think It should be counted. @arielshaqed what do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. I'd count it as a separate category maybe, that way if it starts showing up we have a chance of someone complaining.
pkg/graveler/committed/apply.go
Outdated
switch { | ||
case bytes.Compare(diffRange.MaxKey, sourceRange.MinKey) < 0: | ||
// insert diff | ||
writer.WriteRange(*diffRange) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check all Write*
values :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to check all Write*
values.
pkg/graveler/committed/apply.go
Outdated
addIntoDiffSummary(&ret, graveler.DiffTypeRemoved, int(diffRange.Count)) | ||
haveSource = source.Next() | ||
haveDiffs = diffs.Next() | ||
default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👉 I don't understand: are these all the options? What about when sourceRange
and diffRange
have overlapping keys?
source [min].....[max]
------------------------------------------------------------------------------------------------
diffs [min...........max]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of overlapping keys, we call next on both ranges In order to compare keys.
Codecov Report
@@ Coverage Diff @@
## master #1652 +/- ##
==========================================
+ Coverage 39.28% 39.76% +0.48%
==========================================
Files 167 171 +4
Lines 13563 14000 +437
==========================================
+ Hits 5328 5567 +239
- Misses 7471 7652 +181
- Partials 764 781 +17
Continue to review full report at Codecov.
|
@arielshaqed Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Nice work, but many details in a large 🐘 PR.
A bit hard to review because some comments appear to have no responses and also no changes. Specifically I am unsure why we poll on cancellation.
I have some (ok, many) places I did not understand, and would prefer to go over those before I read the tests more deeply.
continue | ||
switch typ { | ||
case graveler.DiffTypeAdded: | ||
// exists on source, but not on dest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document DiffTypeAdded
better? I would expect this to be "exists on dest, but not on source" -- here it's the reverse?! (Maybe add pretty pictures...?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also wasn't added as part of this PR, the reason is that we are holding a diff iterator that does a 2 way diff between destination and source. I could also change it if it makes more sense. prefer not as part of this PR. @arielshaqed what do you think?
pkg/graveler/committed/meta_range.go
Outdated
Next() bool | ||
// NextRange skips the current range | ||
// possible only if we are currently inside a range | ||
// the next value could be Range or Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could only get a range when one of the ranges is completely before the other range,
In case we have overlapping ranges we could not return range.
Great optimization!! |
I have made changes according to the comments and added tests. |
pkg/graveler/committed/diff_test.go
Outdated
actualDiffKeys = append(actualDiffKeys, string(it.Value().Key)) | ||
diffs = append(diffs, it.Value()) | ||
diff, rng := it.Value() | ||
actualDiffRanges = append(actualDiffRanges, newExpectedRange(rng)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you're adding an "expected range" into an array of "actual range" - which means maybe the object type should simply be "range" or "testRange"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both are already in use, changing to diffTestRange
@guy-har , it is very hard for me to review a monolithic PR such as this one. I wrote in the previous round that:
Please note that I am trying to be very explicit about both these requests. The second remains open -- we poll on cancellation throughout the code, I am not sure why. About the second: You marked most changes as resolved, but not all -- some just vanish into an "outdated" section. This makes it harder for me to track what came of each one. Sometimes the comment is marked "resolved" and but I do not understand the resolution. For instance I did not understand how you resolved this comment. A short comment might help, particularly when the fix is nontrivial. In any case, apologies for an expected long round-trip time. |
@arielshaqed I marked the resolved after changing according to the comment, I will go over them again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Nothing major (I hope), but must do these before pulling:
- Document merge iterator
NextRange
and explain diff iteratorNextRange
when there is no range. - AFAICT one test uses
deep.Equal
on a struct with lowercase field names, so it does very little.
|
||
func incrementDiffSummary(d *graveler.DiffSummary, typ graveler.DiffType) { | ||
addIntoDiffSummary(d, typ, 1) | ||
type applier struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is an object. The only state it holds is a diff summary, which is anyway implemented as a function, and the two have*
members. It is created once, then its apply
method is called -- with no parameters. AFAICT it is just a function that has been represented as an object. Having it as an object makes me have to worry about whether it does actually have important state. Calls to functions that modify have*
could be modeled as re-assigning some function-local state variables. Alternatively, if there is a useful abstraction for "iterator-and-the-last-value-from-next", then let's implement that and put the X
Iterator
and haveX
into a single class named something like IteratorWithEnd
.
If the goal is to have a nice struct with all the used parameters in it then just have a struct, no need to attach methods to it. But I would really question why passing a single parameter named a
to a function is clearer than passing the named parameters to that function. E.g. as it stands, the type signature for addIntoDiffSumary
doesn't indicate that it actually changes only a.summary
, but does not advance any iterators or write ranges. Sure, the call a.addIntoDiffSummary(typ, n)
is shorter than addIntoDiffSummary(summary, type, n)
, but I claim that is just because it gives less information, and forces me to go read the code.
Also nit: applier
sounds like Go naming for an interface.
iterValue, iterRange := iter.Value() | ||
if iterValue == nil { | ||
if iterRange.IsTombstone() { | ||
// internal error but no data lost: deletion requested of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. I'd count it as a separate category maybe, that way if it starts showing up we have a chance of someone complaining.
// applyFromSource applies all changes from source to writer. | ||
func applyFromSource(ctx context.Context, logger logging.Logger, writer MetaRangeWriter, source Iterator) error { | ||
// applyAll applies all changes from Iterator to writer and returns the number of writes | ||
func (a *applier) applyAll(iter Iterator) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. because of the iterator/have separation above, I think this function is unsafe to call on iterator X
unless haveX
.
case <-ctx.Done(): | ||
return 0, ctx.Err() | ||
default: | ||
func (a *applier) hasChanges(summary graveler.DiffSummary) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pure function not a method.
break | ||
} | ||
} | ||
} | ||
return source.Err() | ||
return count, iter.Err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. after calling this method the caller has to increment the correct statistics field -- even though it sort-of has access to the statistics (it just doesn't know what type to increment).
return string(rng.ID) | ||
} | ||
|
||
func TestMergeRange(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test! 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 😄
pkg/graveler/committed/meta_range.go
Outdated
// the next value could be Range or Value | ||
NextRange() bool | ||
// Value returns a nil ValueRecord and a Range before starting a Range, or a Value and that Range when inside a Range. | ||
// In contrast to Iterator, the DiffIterator might not have a current range - this could happen if the current value exists in two different ranges |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explain what NextRange
does in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arielshaqed , I changed the documentation of the DiffIterator
, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great pic!
LEBTM (looks even better to me), thanks.
// DiffIterator might contain ranges without headers | ||
// for example: | ||
// | ||
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max] | ||
// ------------------------------------------------------------------------------------------------ | ||
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max] | ||
// | ||
// R1 - will return as diff with header | ||
// R2 - will return as diff with header | ||
// R3 and R4 - could not return a header because we must enter the ranges in order to get some header values (such as count) | ||
// R5 and R6 - same as R3 and R4 | ||
// R7 - in case R5 has no values in the R7 range, R7 would return as a diff with header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat explanation, great drawing! I would maybe rephrase the tense to make it a bit shorter.
// DiffIterator might contain ranges without headers | |
// for example: | |
// | |
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max] | |
// ------------------------------------------------------------------------------------------------ | |
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max] | |
// | |
// R1 - will return as diff with header | |
// R2 - will return as diff with header | |
// R3 and R4 - could not return a header because we must enter the ranges in order to get some header values (such as count) | |
// R5 and R6 - same as R3 and R4 | |
// R7 - in case R5 has no values in the R7 range, R7 would return as a diff with header | |
// DiffIterator may contain ranges without headers. For instance, consider the diff of | |
// these ranges: | |
// | |
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max] | |
// ------------------------------------------------------------------------------------------------ | |
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max] | |
// | |
// R1 - returned as a diff inside range R1. | |
// R2 - returned as a diff inside range R2. | |
// R3 and R4 - not in any existing range (and creating a new range would require consuming them). | |
// R5 and R6 - as above, not in any existing range. | |
// R7 - returned as a diff inside range R7 if R5 has no keys inside the range of R7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 🙏
No description provided.