Skip to content

Commit fd89cc6

Browse files
committed
Find the longest match in the suffix-sorted array
Fixes #2 The current implementation performs a binary search through the suffix-sorted array to find byte sequence matches between old and new files. However, the algorithm is not optimal when it repeatedly matches very short strings, leading to performance issues as reported in issue This commit changes the algorithm to consider *all* matches encountered during the binary search and choose the longest of these matches. The overall performance impact is yet to be determined, but it appears to yield a small percentage increase in diff creation time (expected) and a large percentage *decrease* for the case of diffing the files from issue #2. In my testing, the creation time there decreases from approx 64 minutes to 4.5 seconds. Signed-off-by: Patrick McCarty <patrick.mccarty@intel.com>
1 parent d039492 commit fd89cc6

File tree

1 file changed

+38
-7
lines changed

1 file changed

+38
-7
lines changed

src/diff.c

+38-7
Original file line numberDiff line numberDiff line change
@@ -88,27 +88,58 @@ static int64_t matchlen(u_char *old, int64_t oldsize, u_char *new,
8888
return i;
8989
}
9090

91+
int64_t max_len = 0;
92+
93+
/**
94+
* Finds the longest matching array of bytes between the OLD and NEW file. The
95+
* old file is suffix-sorted; the suffix-sorted array is stored at I, and
96+
* indices to search between are indicated by ST (start) and EN (end). Returns
97+
* the length of the match, and POS is updated to the position of the match
98+
* within OLD.
99+
*/
91100
static int64_t search(int64_t *I, u_char *old, int64_t oldsize,
92101
u_char *new, int64_t newsize, int64_t st, int64_t en,
93102
int64_t *pos)
94103
{
95104
int64_t x, y;
96105

106+
/* Initialize max_len for the binary search */
107+
if (st == 0 && en == oldsize) {
108+
max_len = matchlen(old, oldsize, new, newsize);
109+
*pos = I[st];
110+
}
111+
112+
/* The binary search terminates here when "en" and "st" are adjacent
113+
* indices in the suffix-sorted array. */
97114
if (en - st < 2) {
98115
x = matchlen(old + I[st], oldsize - I[st], new, newsize);
99-
y = matchlen(old + I[en], oldsize - I[en], new, newsize);
100-
101-
if (x > y) {
116+
if (x > max_len) {
117+
max_len = x;
102118
*pos = I[st];
103-
return x;
104-
} else {
119+
}
120+
y = matchlen(old + I[en], oldsize - I[en], new, newsize);
121+
if (y > max_len) {
122+
max_len = y;
105123
*pos = I[en];
106-
return y;
107124
}
125+
126+
return max_len;
108127
}
109128

110129
x = st + (en - st) / 2;
111-
if (memcmp(old + I[x], new, MIN(oldsize - I[x], newsize)) < 0) {
130+
131+
int64_t length = MIN(oldsize - I[x], newsize);
132+
u_char *oldoffset = old + I[x];
133+
134+
/* This match *could* be the longest one, so check for that here */
135+
int64_t tmp = matchlen(oldoffset, length, new, length);
136+
if (tmp > max_len) {
137+
max_len = tmp;
138+
*pos = I[x];
139+
}
140+
141+
/* Determine how to continue the binary search */
142+
if (memcmp(oldoffset, new, length) < 0) {
112143
return search(I, old, oldsize, new, newsize, x, en, pos);
113144
} else {
114145
return search(I, old, oldsize, new, newsize, st, x, pos);

0 commit comments

Comments
 (0)