-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FindMergeBase - Removing redundant commit node visits #2968
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! LGTM!
Let's have another set of eyes on this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Consider also consulting @johnnyaug about this code :-)
Adding comments about how we might further greatly speed up this function to the issue: while the current algorithm seems very precise and fast, ACAIFT it might performs many redundant and intentionally slow database accesses for each GetCommit
call.
reached[parent] |= commitFlags | ||
if reached[parent]&fromLeft != 0 && reached[parent]&fromRight != 0 { | ||
// commit was reached from both left and right nodes | ||
// Since it is not certain that the commit exist in the commits queue, a call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@itai-david Can you give an example when the parent commit is not in the queue at this point?
at the first time the parent is reached it will be enqueued in line 51.
@johnnyaug @arielshaqed you may know the answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment is misleading - it can be removed.
We simply don't have the commit in memory at this point, so GetCommit is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we cache commits in memory, it will be. Otherwise we might have read it or not, depending on the structure of the graph. Agree wtih @johnnyaug the comment can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any comment is needed - it confuses more than it clears anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the comment, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Sorry for causing this bug.
Left minor comments.
commit, err := getter.GetCommit(ctx, repositoryID, parent) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return commit, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit, err := getter.GetCommit(ctx, repositoryID, parent) | |
if err != nil { | |
return nil, err | |
} | |
return commit, nil | |
return getter.GetCommit(ctx, repositoryID, parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
commit, err := getCommitAndEnqueue(ctx, getter, &queue, repositoryID, leftID) | ||
if err != nil { | ||
return nil, err | ||
} | ||
if leftID == rightID { | ||
return commit, nil | ||
} | ||
|
||
_, err = getCommitAndEnqueue(ctx, getter, &queue, repositoryID, rightID) | ||
if err != nil { | ||
return nil, err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this part?
Wouldn't it work with the hypothetical commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnnyaug - The fix is relying on the reached
map to identify commit nodes that were already visited. Since the initial state marks the 2 target commit as reached, they will not be queued for further traversal when "reached" from the hypothetical node, so I had to enqueue both of the nodes, as part of the initial state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition to the test!
reached[parent] |= commitFlags | ||
if reached[parent]&fromLeft != 0 && reached[parent]&fromRight != 0 { | ||
// commit was reached from both left and right nodes | ||
// Since it is not certain that the commit exist in the commits queue, a call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we cache commits in memory, it will be. Otherwise we might have read it or not, depending on the structure of the graph. Agree wtih @johnnyaug the comment can be removed.
Fixes #2962
This PR is about an efficiency issue discovered in
FindMergeBase
The function performs a BFS from 2 nodes, in order to find its best common ancestor
The BFS analyzes the parent commit nodes for each of the commit nodes and their parent commit nodes in turn
Since a certain commit can act as a parent for more than one commit, it is possible that it is met multiple times during the BFS and it is added multiple times. Once this is done, the entire ancestry chain is also processed multiple times and in case grid-like commits graph, this can quicky lead to a lot of redundant processing of the same nodes
This code removes the redundant processing as nodes are only processed (queued and scanned) once, but their 'reached from' status is still updated and checked each time these nodes are met. The first commit node to be reached from both source commits, is the requested ancestor - same as the old logic
Performance improvement:
Running locally, a diff that previously ran millions of iterations, ran 155 iterations instead, and finished within seconds.