Improve binary search performance time for diff creation #3

phmccarty · 2017-06-07T22:43:41Z

When sections of old_data and new_data are found to be equal in the
course of searching for matches, short circuit at that point.

When sections of old_data and new_data are found to be equal in the course of searching for matches, short circuit at that point. Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>

phmccarty · 2017-06-07T22:47:32Z

This change will need a lot more testing, but it fixes the performance issue #2.

Feedback welcome!

devimc · 2017-06-08T12:49:02Z

src/diff.c

 		return search(I, old, oldsize, new, newsize, st, x, pos);
+	} else {


this else is not needed

Technically, yes, but this is a style preference. I think having the else block keeps the code in a more maintainable state, since it is linked (conceptually) to the memcmp result.

devimc · 2017-06-08T12:50:03Z

src/diff.c

+		/* As a special case, short circuit for the first exact match
+		 * between old_data and new_data, since future exact matches
+		 * will have shorter length. */
+		*pos = I[en];


before assign a value to pos , you should validate that it is not NULL

The function that populates the I array will never add NULL values because the values are offsets into old_data (so, greater than or equal to 0).

devimc · 2017-06-08T12:51:48Z

@phmccarty do you have metrics/numbers to see how much this change improve the performance?

phmccarty · 2017-06-08T17:36:26Z

@devimc So far, I only have the numbers for creating diffs between the files posted in issue #2.

With the v1.0.2 release, run time is 64 minutes, and with this patch, it is 3 seconds. So, a big improvement for this case :-)

tmarcu · 2017-06-08T18:35:16Z

+1 to the code changes, but I think we now need to add some more heuristics and testing to ensure that the algorithm is not too greedy.

tpepper · 2017-06-08T18:40:53Z

Looks good to me in general, but as @tmarcu mentions could be too greedy. Perhaps for the new unconditional condition added could instead of simply }else{ instead be conditioned on absolute size of length since you now have that in a variable? And maybe start with something big (500k?), since it's the bigs where the issue's been noticed?

phmccarty · 2017-12-19T23:01:15Z

I started looking into this issue again recently, and I want to take a much different approach to solving it. I'll open a new PR for the new proposed changes.

zxnmhd · 2018-03-30T06:37:38Z

@phmccarty Can you share the link of other PR you proposed above to open to solve it with the different approach?

phmccarty · 2018-04-23T17:51:14Z

@zxnmhd Yes, I implemented the solution in #6

Improve binary search performance time for diff creation

024268a

When sections of old_data and new_data are found to be equal in the course of searching for matches, short circuit at that point. Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>

phmccarty mentioned this pull request Jun 7, 2017

bsdiff runs very long #2

Closed

devimc reviewed Jun 8, 2017

View reviewed changes

phmccarty closed this Dec 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve binary search performance time for diff creation #3

Improve binary search performance time for diff creation #3

Uh oh!

phmccarty commented Jun 7, 2017

Uh oh!

phmccarty commented Jun 7, 2017

Uh oh!

devimc Jun 8, 2017

Uh oh!

phmccarty Jun 8, 2017

Uh oh!

devimc Jun 8, 2017

Uh oh!

phmccarty Jun 8, 2017

Uh oh!

devimc commented Jun 8, 2017

Uh oh!

phmccarty commented Jun 8, 2017

Uh oh!

tmarcu commented Jun 8, 2017

Uh oh!

tpepper commented Jun 8, 2017

Uh oh!

phmccarty commented Dec 19, 2017

Uh oh!

zxnmhd commented Mar 30, 2018

Uh oh!

phmccarty commented Apr 23, 2018

Uh oh!

Uh oh!

		return search(I, old, oldsize, new, newsize, st, x, pos);
		} else {

Improve binary search performance time for diff creation #3

Improve binary search performance time for diff creation #3

Uh oh!

Conversation

phmccarty commented Jun 7, 2017

Uh oh!

phmccarty commented Jun 7, 2017

Uh oh!

devimc Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

phmccarty Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

devimc Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

phmccarty Jun 8, 2017

Choose a reason for hiding this comment

Uh oh!

devimc commented Jun 8, 2017

Uh oh!

phmccarty commented Jun 8, 2017

Uh oh!

tmarcu commented Jun 8, 2017

Uh oh!

tpepper commented Jun 8, 2017

Uh oh!

phmccarty commented Dec 19, 2017

Uh oh!

zxnmhd commented Mar 30, 2018

Uh oh!

phmccarty commented Apr 23, 2018

Uh oh!

Uh oh!