Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix merge large changes performance #1652

Merged
merged 8 commits into from
Mar 31, 2021
Merged

Fix merge large changes performance #1652

merged 8 commits into from
Mar 31, 2021

Conversation

guy-har
Copy link
Contributor

@guy-har guy-har commented Mar 18, 2021

No description provided.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed apply and apply_test (only). Not sure how it passes tests; I think there's a bug, which would mean we need more test coverage. OR (even more likely) I don't understand the code. Either is a good reason to pause reviewing and give you a chance to respond!
Sorry...

iterValue, iterRange := iter.Value()
if iterValue == nil {
if iterRange.IsTombstone() {
// internal error but no data lost: deletion requested of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not counted, I assume on purpose because it's a probably bug? Maybe still count unjustified deletes just to be sure...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyFromIter only returns addition because it runs at when there is no iterator to compare with, so it should have only additions. I could add "deletions" to it, but I don't think It should be counted. @arielshaqed what do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. I'd count it as a separate category maybe, that way if it starts showing up we have a chance of someone complaining.

switch {
case bytes.Compare(diffRange.MaxKey, sourceRange.MinKey) < 0:
// insert diff
writer.WriteRange(*diffRange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check all Write* values :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to check all Write* values.

addIntoDiffSummary(&ret, graveler.DiffTypeRemoved, int(diffRange.Count))
haveSource = source.Next()
haveDiffs = diffs.Next()
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👉 I don't understand: are these all the options? What about when sourceRange and diffRange have overlapping keys?

source               [min].....[max]
        ------------------------------------------------------------------------------------------------
diffs       [min...........max]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of overlapping keys, we call next on both ranges In order to compare keys.

@codecov-io
Copy link

codecov-io commented Mar 21, 2021

Codecov Report

Merging #1652 (9ba730e) into master (35e5261) will increase coverage by 0.48%.
The diff coverage is 73.63%.

❗ Current head 9ba730e differs from pull request most recent head dda29e5. Consider uploading reports for the commit dda29e5 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1652      +/-   ##
==========================================
+ Coverage   39.28%   39.76%   +0.48%     
==========================================
  Files         167      171       +4     
  Lines       13563    14000     +437     
==========================================
+ Hits         5328     5567     +239     
- Misses       7471     7652     +181     
- Partials      764      781      +17     
Impacted Files Coverage Δ
pkg/graveler/committed/manager.go 24.13% <18.18%> (+24.13%) ⬆️
pkg/graveler/committed/diff_values.go 38.46% <38.46%> (ø)
pkg/graveler/committed/wrappers.go 42.85% <42.85%> (ø)
pkg/graveler/committed/apply.go 57.62% <56.39%> (+3.59%) ⬆️
pkg/graveler/committed/merge_iterator.go 90.14% <89.71%> (-1.11%) ⬇️
pkg/graveler/committed/diff.go 96.85% <95.09%> (-3.15%) ⬇️
pkg/graveler/committed/merge_value_iterator.go 100.00% <100.00%> (ø)
pkg/graveler/committed/meta_range.go 100.00% <100.00%> (ø)
pkg/graveler/committed/range.go 91.66% <100.00%> (+4.16%) ⬆️
pkg/graveler/committed/value.go 74.35% <0.00%> (-4.02%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35e5261...dda29e5. Read the comment docs.

@guy-har
Copy link
Contributor Author

guy-har commented Mar 21, 2021

@arielshaqed Thanks,
I made some changes and added some tests.
I am planning on adding some more tests.

@guy-har guy-har requested a review from arielshaqed March 21, 2021 07:58
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Nice work, but many details in a large 🐘 PR.

A bit hard to review because some comments appear to have no responses and also no changes. Specifically I am unsure why we poll on cancellation.

I have some (ok, many) places I did not understand, and would prefer to go over those before I read the tests more deeply.

continue
switch typ {
case graveler.DiffTypeAdded:
// exists on source, but not on dest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document DiffTypeAdded better? I would expect this to be "exists on dest, but not on source" -- here it's the reverse?! (Maybe add pretty pictures...?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also wasn't added as part of this PR, the reason is that we are holding a diff iterator that does a 2 way diff between destination and source. I could also change it if it makes more sense. prefer not as part of this PR. @arielshaqed what do you think?

Next() bool
// NextRange skips the current range
// possible only if we are currently inside a range
// the next value could be Range or Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could only get a range when one of the ranges is completely before the other range,
In case we have overlapping ranges we could not return range.

@johnnyaug
Copy link
Contributor

Great optimization!!

@guy-har
Copy link
Contributor Author

guy-har commented Mar 29, 2021

I have made changes according to the comments and added tests.
@arielshaqed @johnnyaug PTAL.

actualDiffKeys = append(actualDiffKeys, string(it.Value().Key))
diffs = append(diffs, it.Value())
diff, rng := it.Value()
actualDiffRanges = append(actualDiffRanges, newExpectedRange(rng))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're adding an "expected range" into an array of "actual range" - which means maybe the object type should simply be "range" or "testRange"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both are already in use, changing to diffTestRange

@guy-har guy-har requested a review from johnnyaug March 30, 2021 06:23
@arielshaqed
Copy link
Contributor

@guy-har , it is very hard for me to review a monolithic PR such as this one. I wrote in the previous round that:

A bit hard to review because some comments appear to have no responses and also no changes. Specifically I am unsure why we poll on cancellation.

Please note that I am trying to be very explicit about both these requests. The second remains open -- we poll on cancellation throughout the code, I am not sure why.

About the second: You marked most changes as resolved, but not all -- some just vanish into an "outdated" section. This makes it harder for me to track what came of each one. Sometimes the comment is marked "resolved" and but I do not understand the resolution. For instance I did not understand how you resolved this comment.

A short comment might help, particularly when the fix is nontrivial.

In any case, apologies for an expected long round-trip time.

@guy-har
Copy link
Contributor Author

guy-har commented Mar 30, 2021

@arielshaqed I marked the resolved after changing according to the comment, I will go over them again.
according to polling on cancelation. I answered in both comments and marked as resolved, which made it not visible (sorry about that). My answer to the polling was that it was added as part of #1603, I basically just moved it around. We could decide we wan't to remove it, but I don't think it should be part of this PR.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Nothing major (I hope), but must do these before pulling:

  1. Document merge iterator NextRange and explain diff iterator NextRange when there is no range.
  2. AFAICT one test uses deep.Equal on a struct with lowercase field names, so it does very little.


func incrementDiffSummary(d *graveler.DiffSummary, typ graveler.DiffType) {
addIntoDiffSummary(d, typ, 1)
type applier struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is an object. The only state it holds is a diff summary, which is anyway implemented as a function, and the two have* members. It is created once, then its apply method is called -- with no parameters. AFAICT it is just a function that has been represented as an object. Having it as an object makes me have to worry about whether it does actually have important state. Calls to functions that modify have* could be modeled as re-assigning some function-local state variables. Alternatively, if there is a useful abstraction for "iterator-and-the-last-value-from-next", then let's implement that and put the X Iterator and haveX into a single class named something like IteratorWithEnd.

If the goal is to have a nice struct with all the used parameters in it then just have a struct, no need to attach methods to it. But I would really question why passing a single parameter named a to a function is clearer than passing the named parameters to that function. E.g. as it stands, the type signature for addIntoDiffSumary doesn't indicate that it actually changes only a.summary, but does not advance any iterators or write ranges. Sure, the call a.addIntoDiffSummary(typ, n) is shorter than addIntoDiffSummary(summary, type, n), but I claim that is just because it gives less information, and forces me to go read the code.

Also nit: applier sounds like Go naming for an interface.

iterValue, iterRange := iter.Value()
if iterValue == nil {
if iterRange.IsTombstone() {
// internal error but no data lost: deletion requested of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. I'd count it as a separate category maybe, that way if it starts showing up we have a chance of someone complaining.

// applyFromSource applies all changes from source to writer.
func applyFromSource(ctx context.Context, logger logging.Logger, writer MetaRangeWriter, source Iterator) error {
// applyAll applies all changes from Iterator to writer and returns the number of writes
func (a *applier) applyAll(iter Iterator) (int, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. because of the iterator/have separation above, I think this function is unsafe to call on iterator X unless haveX.

case <-ctx.Done():
return 0, ctx.Err()
default:
func (a *applier) hasChanges(summary graveler.DiffSummary) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pure function not a method.

break
}
}
}
return source.Err()
return count, iter.Err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. after calling this method the caller has to increment the correct statistics field -- even though it sort-of has access to the statistics (it just doesn't know what type to increment).

return string(rng.ID)
}

func TestMergeRange(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test! 😎

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😄

// the next value could be Range or Value
NextRange() bool
// Value returns a nil ValueRecord and a Range before starting a Range, or a Value and that Range when inside a Range.
// In contrast to Iterator, the DiffIterator might not have a current range - this could happen if the current value exists in two different ranges
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain what NextRange does in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielshaqed , I changed the documentation of the DiffIterator, PTAL

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great pic!

LEBTM (looks even better to me), thanks.

Comment on lines +29 to +40
// DiffIterator might contain ranges without headers
// for example:
//
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max]
// ------------------------------------------------------------------------------------------------
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max]
//
// R1 - will return as diff with header
// R2 - will return as diff with header
// R3 and R4 - could not return a header because we must enter the ranges in order to get some header values (such as count)
// R5 and R6 - same as R3 and R4
// R7 - in case R5 has no values in the R7 range, R7 would return as a diff with header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat explanation, great drawing! I would maybe rephrase the tense to make it a bit shorter.

Suggested change
// DiffIterator might contain ranges without headers
// for example:
//
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max]
// ------------------------------------------------------------------------------------------------
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max]
//
// R1 - will return as diff with header
// R2 - will return as diff with header
// R3 and R4 - could not return a header because we must enter the ranges in order to get some header values (such as count)
// R5 and R6 - same as R3 and R4
// R7 - in case R5 has no values in the R7 range, R7 would return as a diff with header
// DiffIterator may contain ranges without headers. For instance, consider the diff of
// these ranges:
//
// left [min].R1.[max] [min].R3.[max] [min]...............R5..............[max]
// ------------------------------------------------------------------------------------------------
// right [min].R2.[max] [min.....R4....max] [min].R6.[max] [min].R7.[max]
//
// R1 - returned as a diff inside range R1.
// R2 - returned as a diff inside range R2.
// R3 and R4 - not in any existing range (and creating a new range would require consuming them).
// R5 and R6 - as above, not in any existing range.
// R7 - returned as a diff inside range R7 if R5 has no keys inside the range of R7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🙏

@guy-har guy-har linked an issue Mar 31, 2021 that may be closed by this pull request
@guy-har guy-har changed the title Merge with diff Fix merge large changes performance Mar 31, 2021
@guy-har guy-har merged commit d5605ac into master Mar 31, 2021
@guy-har guy-har deleted the fix/merge-with-range branch March 31, 2021 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge large changes performance
4 participants