Skip to content

Improve binary search performance time for diff creation #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

phmccarty
Copy link
Contributor

When sections of old_data and new_data are found to be equal in the
course of searching for matches, short circuit at that point.

When sections of old_data and new_data are found to be equal in the
course of searching for matches, short circuit at that point.

Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>
@phmccarty phmccarty mentioned this pull request Jun 7, 2017
@phmccarty
Copy link
Contributor Author

This change will need a lot more testing, but it fixes the performance issue #2.

Feedback welcome!

return search(I, old, oldsize, new, newsize, st, x, pos);
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes, but this is a style preference. I think having the else block keeps the code in a more maintainable state, since it is linked (conceptually) to the memcmp result.

/* As a special case, short circuit for the first exact match
* between old_data and new_data, since future exact matches
* will have shorter length. */
*pos = I[en];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before assign a value to pos , you should validate that it is not NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function that populates the I array will never add NULL values because the values are offsets into old_data (so, greater than or equal to 0).

@devimc
Copy link

devimc commented Jun 8, 2017

@phmccarty do you have metrics/numbers to see how much this change improve the performance?

@phmccarty
Copy link
Contributor Author

@devimc So far, I only have the numbers for creating diffs between the files posted in issue #2.

With the v1.0.2 release, run time is 64 minutes, and with this patch, it is 3 seconds. So, a big improvement for this case :-)

@tmarcu
Copy link

tmarcu commented Jun 8, 2017

+1 to the code changes, but I think we now need to add some more heuristics and testing to ensure that the algorithm is not too greedy.

@tpepper
Copy link

tpepper commented Jun 8, 2017

Looks good to me in general, but as @tmarcu mentions could be too greedy. Perhaps for the new unconditional condition added could instead of simply }else{ instead be conditioned on absolute size of length since you now have that in a variable? And maybe start with something big (500k?), since it's the bigs where the issue's been noticed?

@phmccarty
Copy link
Contributor Author

I started looking into this issue again recently, and I want to take a much different approach to solving it. I'll open a new PR for the new proposed changes.

@phmccarty phmccarty closed this Dec 19, 2017
@zxnmhd
Copy link

zxnmhd commented Mar 30, 2018

@phmccarty Can you share the link of other PR you proposed above to open to solve it with the different approach?

@phmccarty
Copy link
Contributor Author

@zxnmhd Yes, I implemented the solution in #6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants