Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor part 2 - separate merge from diff iterator [change-log] optimize merge by 20 percent #2884

Merged
merged 17 commits into from
Feb 6, 2022

Conversation

eden-ohana
Copy link
Contributor

@eden-ohana eden-ohana commented Jan 23, 2022

closes #2784
Done - Refactor part 1: separate commit and merge

Refactor part 2:

  • Separate merge ​from diff iterator (base, source, dest iterators instead)
  • Merge iterator is not used anymore by the merge (didn't delete because compare iterator)
  • Apply is deleted (both merge and commit doesn't use it anymore)
  • No longer supporting diff summary for merge
  • Allowing merge with no changes
  • Optimize seek GE for base iterator

Next:

  • Simplify the diff iterator by removing all extra functionality for merge
  • Separate compare iterator from diff iterator

@eden-ohana eden-ohana added the exclude-changelog PR description should not be included in next release changelog label Jan 23, 2022
defer metaRangeIterator.Close()
summary, err = Apply(ctx, mwWriter, metaRangeIterator, diffs, &ApplyOptions{})

summary, err = Merge(ctx, mwWriter, baseIt, sourceIt, destIt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove summary if it isn't supported anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

summary graveler.DiffSummary
}

// baseGE returns range from base iterator, which is greater or equal than the given key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name and description are quite confusing, it doesn't really do as described.
IIUC the function does seekGE but from the current location. This isn't intuitive when using iterators. please document well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 41 to 42
// baseGE returns value from base iterator, which is greater or equal than the given key
func (m *merger) baseKeyGE(key graveler.Key) (*graveler.ValueRecord, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in baseRangeGE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 35 to 38
if err := m.base.Err(); err != nil {
return nil, err
}
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := m.base.Err(); err != nil {
return nil, err
}
return nil, nil
return nil, m.base.Err()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +64 to +67
if err := m.base.Err(); err != nil {
return nil, err
}
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := m.base.Err(); err != nil {
return nil, err
}
return nil, nil
return nil, m.base.Err()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if baseValue != nil && bytes.Equal(sourceValue.Identity, baseValue.Identity) { // dest deleted this record
m.haveSource = m.source.Next()
} else {
if baseValue != nil && bytes.Equal(sourceValue.Key, baseValue.Key) { // conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if baseValue != nil && bytes.Equal(sourceValue.Key, baseValue.Key) { // conflict
if baseValue != nil && bytes.Equal(sourceValue.Key, baseValue.Key) { // deleted by dest and changed by source

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if baseValue != nil && bytes.Equal(destValue.Identity, baseValue.Identity) { // source deleted this record
m.haveDest = m.dest.Next()
} else {
if baseValue != nil && bytes.Equal(destValue.Key, baseValue.Key) { // conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if baseValue != nil && bytes.Equal(destValue.Key, baseValue.Key) { // conflict
if baseValue != nil && bytes.Equal(destValue.Key, baseValue.Key) { // deleted by source added by dest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return graveler.ErrConflictFound
}
}
// both added the same record
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not precise, there is also the (more common) case of no change
sourceValue.Identity == destValue.Identity == baseValue.Identity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

m.haveDest = m.dest.NextRange()
return nil
} else {
return graveler.ErrConflictFound // conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No conflicts on ranges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pkg/graveler/committed/merge.go Show resolved Hide resolved
Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work!
I didn't finish reviewing yet, but since there is already one critical comment I'm publishing my review. Will continue tomorrow

@@ -53,7 +53,7 @@ func TestIdentity(t *testing.T) {

resp, err := client.MergeIntoBranchWithResponse(ctx, repo, branch1, branch2, api.MergeIntoBranchJSONRequestBody{})
require.NoError(t, err, "error during merge")
require.NotNil(t, resp.JSON400, "merge should fail since there are no changes between the branches")
require.Nil(t, resp.JSON400, "allow merge with no changes between the branches")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a strange test to check specifically that the 400 error is empty - should probably check that the returned code is not empty (200?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 346 to 348
if err != nil {
t.Fatal(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != nil {
t.Fatal(err)
}
testutil.Must(t, err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"fmt"
"testing"

"github.com/go-test/deep"
"github.com/stretchr/testify/assert"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done - there's an empty line here

@@ -84,7 +86,6 @@ func Test_merge(t *testing.T) {
destRange []testRange
expectedActions []writeAction
conflictExpectedIdx *int
expectedSummary graveler.DiffSummary
expectedErr error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always nil in this test: not needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -239,8 +235,14 @@ func Test_merge(t *testing.T) {
conflictExpectedIdx: nil,
expectedActions: []writeAction{
{
action: actionTypeWriteRange,
rng: committed.Range{ID: "base:k1-k3", MinKey: committed.Key("k1"), MaxKey: committed.Key("k3"), Count: 2, EstimatedSize: 1024},
action: actionTypeWriteRecord,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change mean there was a regression in the optimization? (also applies to other changes in this test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not a common case so I think we can skip it

// applyBothRanges applies merging when both source and dest on range header
func (m *merger) applyBothRanges(sourceRange *Range, destRange *Range) error {
switch {
case sourceRange.ID == destRange.ID: // source and dest added the same range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// source and dest added the same range

Or, more likely, nobody added anything, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

func (m *merger) baseRangeGE(key graveler.Key) (*Range, error) {
for {
_, baseRange := m.base.Value()
if baseRange != nil && bytes.Compare(baseRange.MaxKey, key) >= 0 {
Copy link
Contributor

@johnnyaug johnnyaug Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for iterating rather than doing SeekGE?
Is it a performance thing?


Naming suggestion: can be called something like "findRangeInBaseIterator" (not good enough, just suggesting a sentiment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It's a performance improvement. Instead of doing seekGE from the beginning of base iterator each time, it continues from its current location.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do a benchmark to validate that these two functions really improve something, and document it?

Comment on lines 193 to 207
if baseRange != nil && (sourceRange.ID == baseRange.ID || destRange.ID == baseRange.ID) {
if sourceRange.ID == baseRange.ID { // dest added changes
err = m.writeRange(destRange)
} else {
err = m.writeRange(sourceRange) // source added changes
}
if err != nil {
return err
}
m.haveSource = m.source.NextRange()
m.haveDest = m.dest.NextRange()
} else { // enter both ranges
m.haveSource = m.source.Next()
m.haveDest = m.dest.Next()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, more readable like this, but it's matter of preference:

Suggested change
if baseRange != nil && (sourceRange.ID == baseRange.ID || destRange.ID == baseRange.ID) {
if sourceRange.ID == baseRange.ID { // dest added changes
err = m.writeRange(destRange)
} else {
err = m.writeRange(sourceRange) // source added changes
}
if err != nil {
return err
}
m.haveSource = m.source.NextRange()
m.haveDest = m.dest.NextRange()
} else { // enter both ranges
m.haveSource = m.source.Next()
m.haveDest = m.dest.Next()
}
if baseRange == nil {
fallthrough;
}
if sourceRange.ID == baseRange.ID {
err = m.writeRange(destRange)
} else if destRange.ID == baseRange.ID {
err = m.writeRange(sourceRange)
} else {
fallthrough;
}
if err != nil {
return err
}
m.haveSource = m.source.NextRange()
m.haveDest = m.dest.NextRange()

m.haveSource = m.source.NextRange()
return nil
}
return graveler.ErrConflictFound // conflict - both changed this range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+add test case that covers this code

return nil
}
if baseRange == nil || destRange.ID == baseRange.ID { // source added this range
err = m.writeRange(sourceRange)
Copy link
Contributor

@johnnyaug johnnyaug Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here, adding a test case which I think will reflect it (add this to Test_merge).
I hope this is a valid case.
@guy-har, would love your input here as well.

		"is it a bug?": {
			baseRange: []testRange{
				{
					rng:     committed.Range{ID: "base:b-c", MinKey: committed.Key("b"), MaxKey: committed.Key("c"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "b", identity: "b"}, {key: "c", identity: "c"}},
				},
				{
					rng:     committed.Range{ID: "base:d-e", MinKey: committed.Key("d"), MaxKey: committed.Key("e"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "d", identity: "d"}, {key: "e", identity: "e"}},
				},
			},
			sourceRange: []testRange{
				{
					rng:     committed.Range{ID: "source:a-d", MinKey: committed.Key("a"), MaxKey: committed.Key("d"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "a", identity: "a"}, {key: "b", identity: "b"}, {key: "d", identity: "d"}},
				},
				{
					rng:     committed.Range{ID: "source:e", MinKey: committed.Key("e"), MaxKey: committed.Key("e"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "e", identity: "e1"}},
				},
			},
			destRange: []testRange{
				{
					rng:     committed.Range{ID: "base:b-c", MinKey: committed.Key("b"), MaxKey: committed.Key("c"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "b", identity: "b"}, {key: "c", identity: "c"}},
				},
				{
					rng:     committed.Range{ID: "dest:d-e", MinKey: committed.Key("d"), MaxKey: committed.Key("e"), Count: 2, EstimatedSize: 1024},
					records: []testValueRecord{{key: "d", identity: "d1"}, {key: "e", identity: "e"}},
				},
			},
			conflictExpectedIdx: nil,
			expectedActions: []writeAction{
				{
					action:   actionTypeWriteRecord,
					key:      "a",
					identity: "a",
				},
				{
					action:   actionTypeWriteRecord,
					key:      "b",
					identity: "b",
				},
				{
					action:   actionTypeWriteRecord,
					key:      "d",
					identity: "d1",
				},
				{
					action:   actionTypeWriteRecord,
					key:      "e",
					identity: "e1",
				},
			},
		}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the bug and added this test, thanks!

Copy link
Contributor

@guy-har guy-har left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting changes because the fast forward optimization for source == base was removed

Comment on lines 111 to 114
if source == base {
// no changes on source
return "", summary, graveler.ErrNoChanges
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this removed? even if it's not an error anymore we could still return destination

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it wasn't supposed to be removed. I added it back without error.

@@ -640,7 +640,7 @@ type CommittedManager interface {
// Merge applies changes from 'source' to 'destination', relative to a merge base 'base' and
// returns the ID of the new metarange and a summary of diffs. This is similar to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// returns the ID of the new metarange and a summary of diffs. This is similar to a
// returns the ID of the new metarange. This is similar to a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +691 to +693
sourceRange: []testRange{
{
rng: committed.Range{ID: "source:a-d", MinKey: committed.Key("a"), MaxKey: committed.Key("d"), Count: 2, EstimatedSize: 1024},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, the test is called dest change key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed its name

expectedErr: graveler.ErrNoChanges,
expectedErr: nil,
},
"source and dest change same range": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"source and dest change same range": {
"dest removed range and added range after source removed range edges": {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the best name, but it's a bit of a weird test, it's not that both changed the same range. It's that source changes are included in dest ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

},
},
},
"dest over source left": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"dest over source left": {
"dest removed all source added": {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +763 to +799
"same identity different key": {
baseRange: []testRange{
{
rng: committed.Range{ID: "base:a-d", MinKey: committed.Key("a"), MaxKey: committed.Key("d"), Count: 2, EstimatedSize: 1024},
records: []testValueRecord{{key: "a", identity: "a"}, {key: "d", identity: "d"}},
},
},
sourceRange: []testRange{
{
rng: committed.Range{ID: "source:a-d", MinKey: committed.Key("a"), MaxKey: committed.Key("d"), Count: 4, EstimatedSize: 1024},
records: []testValueRecord{{key: "a", identity: "a"}, {key: "a1", identity: "a"}, {key: "c", identity: "c"}},
},
},
destRange: []testRange{
{
rng: committed.Range{ID: "dest:a-d", MinKey: committed.Key("a"), MaxKey: committed.Key("d"), Count: 2, EstimatedSize: 1024},
records: []testValueRecord{{key: "a", identity: "a"}, {key: "d", identity: "d"}},
},
},
conflictExpectedIdx: nil,
expectedActions: []writeAction{
{
action: actionTypeWriteRecord,
key: "a",
identity: "a",
},
{
action: actionTypeWriteRecord,
key: "a1",
identity: "a",
},
{
action: actionTypeWriteRecord,
key: "c",
identity: "c",
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

@eden-ohana eden-ohana requested a review from guy-har January 31, 2022 14:12
@@ -46,7 +46,6 @@ type CommitLog struct {
}

type MergeResult struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct is not really needed anymore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -18,27 +18,8 @@ type compareIterator struct {
err error
}

type mergeIterator struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

func (m *merger) baseRangeGE(key graveler.Key) (*Range, error) {
for {
_, baseRange := m.base.Value()
if baseRange != nil && bytes.Compare(baseRange.MaxKey, key) >= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do a benchmark to validate that these two functions really improve something, and document it?

"fmt"
"testing"

"github.com/go-test/deep"
"github.com/stretchr/testify/assert"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done - there's an empty line here

}

// moveBaseToGERange moves base iterator (from current point) to range which is greater or equal than the given key
func (m *merger) moveBaseToGERange(key graveler.Key) (*Range, error) {
Copy link
Contributor

@johnnyaug johnnyaug Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Range which is greater or equal than the given key" is not too well-defined: does it mean that this range starts after this key?
I think this method can be changed so that the code calling it will become simpler (see other comments)

m.haveSource = m.source.NextRange()
return nil
}
if baseRange == nil || destRange.ID == baseRange.ID { // source added this range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check destRange.ID == baseRange.ID is irrelevant, and causes an optimization bug. At this point, destRange starts after sourceRange ends. Whether or not baseRange is the same range is not interesting. Only thing that matters is if baseRange intersects with sourceRange or not.

@eden-ohana eden-ohana requested a review from johnnyaug February 3, 2022 11:27
Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the comments. Looks very good!
Minor comments + would still love to see a benchmark justifying the existence of getNextGEKey

}
}
return result
writeResponse(w, http.StatusOK, MergeResult{Reference: res}) // optimize returning unknown summary = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment not clear - can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return nil, nil
}

// getNextOverlappingFromBase moves base iterator (from current point) to the next overlap range with rangeToOverlap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// getNextOverlappingFromBase moves base iterator (from current point) to the next overlap range with rangeToOverlap
// getNextOverlappingFromBase moves base iterator from its current position to the next range overlapping with rangeToOverlap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@eden-ohana eden-ohana changed the title refactor part 2 - separate merge from diff iterator Refactor part 2 - separate merge from diff iterator [change-log] optimize merge by 20 percent Feb 6, 2022
@eden-ohana eden-ohana added include-changelog PR description should be included in next release changelog and removed exclude-changelog PR description should not be included in next release changelog labels Feb 6, 2022
@eden-ohana
Copy link
Contributor Author

Thanks for the review @guy-har @johnnyaug !
The refactor includes optimization in nextBaseGEkey - instead of moving the base iterator to the next key from the beginning each time, it moves from its current position (without scanning unnecessary ranges).
Before the change the merge took 1.15m. After the optimization it took 56s.

@eden-ohana eden-ohana merged commit 797c391 into master Feb 6, 2022
@eden-ohana eden-ohana deleted the 2784/merge-refactor-2 branch February 6, 2022 16:19
@guy-har
Copy link
Contributor

guy-har commented Feb 6, 2022

@eden-ohana

GREAT JOB 👍

We really needed this refactor, Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor the merge operation
3 participants