Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FindMergeBase - Removing redundant commit node visits #2968

Merged
merged 6 commits into from
Feb 27, 2022

Conversation

itaidavid
Copy link
Contributor

@itaidavid itaidavid commented Feb 26, 2022

Fixes #2962

This PR is about an efficiency issue discovered in FindMergeBase
The function performs a BFS from 2 nodes, in order to find its best common ancestor
The BFS analyzes the parent commit nodes for each of the commit nodes and their parent commit nodes in turn
Since a certain commit can act as a parent for more than one commit, it is possible that it is met multiple times during the BFS and it is added multiple times. Once this is done, the entire ancestry chain is also processed multiple times and in case grid-like commits graph, this can quicky lead to a lot of redundant processing of the same nodes

This code removes the redundant processing as nodes are only processed (queued and scanned) once, but their 'reached from' status is still updated and checked each time these nodes are met. The first commit node to be reached from both source commits, is the requested ancestor - same as the old logic

Performance improvement:
Running locally, a diff that previously ran millions of iterations, ran 155 iterations instead, and finished within seconds.

@itaidavid itaidavid added the include-changelog PR description should be included in next release changelog label Feb 26, 2022
Copy link
Contributor

@talSofer talSofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! LGTM!

Let's have another set of eyes on this change.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Consider also consulting @johnnyaug about this code :-)

Adding comments about how we might further greatly speed up this function to the issue: while the current algorithm seems very precise and fast, ACAIFT it might performs many redundant and intentionally slow database accesses for each GetCommit call.

reached[parent] |= commitFlags
if reached[parent]&fromLeft != 0 && reached[parent]&fromRight != 0 {
// commit was reached from both left and right nodes
// Since it is not certain that the commit exist in the commits queue, a call
Copy link
Contributor

@talSofer talSofer Feb 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itai-david Can you give an example when the parent commit is not in the queue at this point?
at the first time the parent is reached it will be enqueued in line 51.

@johnnyaug @arielshaqed you may know the answer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is misleading - it can be removed.
We simply don't have the commit in memory at this point, so GetCommit is called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we cache commits in memory, it will be. Otherwise we might have read it or not, depending on the structure of the graph. Agree wtih @johnnyaug the comment can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think any comment is needed - it confuses more than it clears anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment, thanks

Copy link
Contributor

@johnnyaug johnnyaug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Sorry for causing this bug.
Left minor comments.

Comment on lines 65 to 69
commit, err := getter.GetCommit(ctx, repositoryID, parent)
if err != nil {
return nil, err
}
return commit, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
commit, err := getter.GetCommit(ctx, repositoryID, parent)
if err != nil {
return nil, err
}
return commit, nil
return getter.GetCommit(ctx, repositoryID, parent)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +29 to +40
commit, err := getCommitAndEnqueue(ctx, getter, &queue, repositoryID, leftID)
if err != nil {
return nil, err
}
if leftID == rightID {
return commit, nil
}

_, err = getCommitAndEnqueue(ctx, getter, &queue, repositoryID, rightID)
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this part?
Wouldn't it work with the hypothetical commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnnyaug - The fix is relying on the reached map to identify commit nodes that were already visited. Since the initial state marks the 2 target commit as reached, they will not be queued for further traversal when "reached" from the hypothetical node, so I had to enqueue both of the nodes, as part of the initial state

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition to the test!

reached[parent] |= commitFlags
if reached[parent]&fromLeft != 0 && reached[parent]&fromRight != 0 {
// commit was reached from both left and right nodes
// Since it is not certain that the commit exist in the commits queue, a call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we cache commits in memory, it will be. Otherwise we might have read it or not, depending on the structure of the graph. Agree wtih @johnnyaug the comment can be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Diff takes more than 4 minutes to complete
4 participants