-
Notifications
You must be signed in to change notification settings - Fork 10
Improve binary search performance time for diff creation #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When sections of old_data and new_data are found to be equal in the course of searching for matches, short circuit at that point. Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>
This change will need a lot more testing, but it fixes the performance issue #2. Feedback welcome! |
return search(I, old, oldsize, new, newsize, st, x, pos); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this else is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes, but this is a style preference. I think having the else
block keeps the code in a more maintainable state, since it is linked (conceptually) to the memcmp
result.
/* As a special case, short circuit for the first exact match | ||
* between old_data and new_data, since future exact matches | ||
* will have shorter length. */ | ||
*pos = I[en]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before assign a value to pos
, you should validate that it is not NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function that populates the I
array will never add NULL values because the values are offsets into old_data (so, greater than or equal to 0).
@phmccarty do you have metrics/numbers to see how much this change improve the performance? |
+1 to the code changes, but I think we now need to add some more heuristics and testing to ensure that the algorithm is not too greedy. |
Looks good to me in general, but as @tmarcu mentions could be too greedy. Perhaps for the new unconditional condition added could instead of simply }else{ instead be conditioned on absolute size of length since you now have that in a variable? And maybe start with something big (500k?), since it's the bigs where the issue's been noticed? |
I started looking into this issue again recently, and I want to take a much different approach to solving it. I'll open a new PR for the new proposed changes. |
@phmccarty Can you share the link of other PR you proposed above to open to solve it with the different approach? |
When sections of old_data and new_data are found to be equal in the
course of searching for matches, short circuit at that point.