Skip to content

Commit

Permalink
Improve BoundedBreakIteratorScanner fragmentation algorithm (#73785)
Browse files Browse the repository at this point in the history
The current approach begins by finding the nearest preceding and following boundaries, and expands the following boundary greedily while it respects the problem restriction. This is fine asymptotically, but BreakIterator which is used to find each boundary is sometimes expensive.

The new approach maximizes the after boundary by scanning for the last boundary preceding the position that would cause the condition to be violated (i.e. knowing start boundary and offset, how many characters are left before resulting length is fragment size). If this scan finds the start boundary, it means it's impossible to satisfy the problem restriction, and we get the first boundary following offset instead (or better, since we already scanned [offset, targetEndOffset], start from targetEndOffset + 1).
  • Loading branch information
thelink2012 authored Jul 5, 2021
1 parent 1db75f7 commit 1fc09f9
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 9 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -89,16 +89,29 @@ public int preceding(int offset) {
innerStart = innerEnd;
innerEnd = windowEnd;
} else {
windowStart = innerStart = mainBreak.preceding(offset);
windowEnd = innerEnd = mainBreak.following(offset - 1);
// expand to next break until we reach maxLen
while (innerEnd - innerStart < maxLen) {
int newEnd = mainBreak.following(innerEnd);
if (newEnd == DONE || (newEnd - innerStart) > maxLen) {
break;
}
windowEnd = innerEnd = newEnd;
innerStart = Math.max(mainBreak.preceding(offset), 0);

final int targetEndOffset = offset + Math.max(0, maxLen - (offset - innerStart));
final int textEndIndex = getText().getEndIndex();

if (targetEndOffset + 1 > textEndIndex) {
innerEnd = textEndIndex;
} else {
innerEnd = mainBreak.preceding(targetEndOffset + 1);
}

assert innerEnd != DONE && innerEnd >= innerStart;

// in case no break was found up to maxLen, find one afterwards.
if (innerStart == innerEnd) {
innerEnd = mainBreak.following(targetEndOffset);
assert innerEnd - innerStart > maxLen;
} else {
assert innerEnd - innerStart <= maxLen;
}

windowStart = innerStart;
windowEnd = innerEnd;
}

if (innerEnd - innerStart > maxLen) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,4 +124,20 @@ public void testBoundedSentence() {
);
}
}

public void testTextThatEndsBeforeMaxLen() {
BreakIterator bi = BoundedBreakIteratorScanner.getSentence(Locale.ROOT, 1000);

final String text = "This is the first test sentence. Here is the second one.";

int offset = text.indexOf("first");
bi.setText(text);
assertEquals(0, bi.preceding(offset));
assertEquals(text.length(), bi.following(offset - 1));

offset = text.indexOf("second");
bi.setText(text);
assertEquals(33, bi.preceding(offset));
assertEquals(text.length(), bi.following(offset - 1));
}
}

0 comments on commit 1fc09f9

Please sign in to comment.