Skip to content

Commit b3ea20f

Browse files
committed
perf: Avoid re-canonicalizing the entire IntervalSet on push
Canonicalize is taking up a significant amount due to a regex with a huge amount of character ranges (generated by [lalrpop](https://github.com/lalrpop/lalrpop)'s lexer expanding multiple `\w` in a token). While this could perhaps be fixed in lalrpop I did notice the TODO in the code and after addressing this so we automatically union and compress on each push instead of re-canonicalizing on every push and that fixed the performance problem. I did see the earlier attempt at this #1051 and it seems like that was reverted and regression tests were added so I hope that and the existing tests are enough (I don't have a clear idea on what tests might be missing).
1 parent 5ea3eb1 commit b3ea20f

File tree

1 file changed

+37
-9
lines changed

1 file changed

+37
-9
lines changed

regex-syntax/src/hir/interval.rs

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -80,15 +80,43 @@ impl<I: Interval> IntervalSet<I> {
8080
}
8181

8282
/// Add a new interval to this set.
83-
pub fn push(&mut self, interval: I) {
84-
// TODO: This could be faster. e.g., Push the interval such that
85-
// it preserves canonicalization.
86-
self.ranges.push(interval);
87-
self.canonicalize();
88-
// We don't know whether the new interval added here is considered
89-
// case folded, so we conservatively assume that the entire set is
90-
// no longer case folded if it was previously.
91-
self.folded = false;
83+
pub fn push(&mut self, mut interval: I) {
84+
match self.ranges.binary_search(&interval) {
85+
// Interval is an exact duplicate of one that exists
86+
Ok(_) => {}
87+
Err(i) => {
88+
// The search finds us the first index where the previous interval start is less
89+
// than the new interval start. Since the existing intervals are non-overlapping
90+
// we only need to try to union this single preceding interval
91+
let mut start = i;
92+
if let Some(before_i) = i.checked_sub(1) {
93+
let before = &self.ranges[before_i];
94+
if let Some(union) = before.union(&interval) {
95+
interval = union;
96+
start = before_i;
97+
}
98+
}
99+
// `interval` may overlap any number of intervals following the insertion point
100+
// so will union each of them until we reach the first non-overlapping interval
101+
let mut end = i;
102+
for after_i in i..self.ranges.len() {
103+
let after = &self.ranges[after_i];
104+
if let Some(union) = interval.union(after) {
105+
interval = union;
106+
end = after_i + 1;
107+
} else {
108+
break;
109+
}
110+
}
111+
112+
self.ranges.splice(start..end, core::iter::once(interval));
113+
114+
// We don't know whether the new interval added here is considered
115+
// case folded, so we conservatively assume that the entire set is
116+
// no longer case folded if it was previously.
117+
self.folded = false;
118+
}
119+
}
92120
}
93121

94122
/// Return an iterator over all intervals in this set.

0 commit comments

Comments
 (0)