-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug in CompressionUtils that fails with certain delimiters #154
Conversation
} | ||
} | ||
|
||
@Test public void testMultiline() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the annotation and method definition are on the same line. is it a new formatting rule?
@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli | |||
for (int i = 0; i < line.length(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this comment. perhaps there is a way to achieve this using regex or something.
Thanks for catching and fixing this, really great! |
assertEquals(event2, result.get(1).getEventAsString()); | ||
assertEquals(event3, result.get(2).getEventAsString()); | ||
|
||
} catch (IOException ex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to catch this exception here. Just make the test method throw exception, and the test will simply fail in case of IOE. Same in the other test.
assertEquals("Delimiter: '" + delimiter + "'", result.size(), 3); | ||
assertEquals(event1, result.get(0).getEventAsString()); | ||
assertEquals(event2, result.get(1).getEventAsString()); | ||
assertEquals(event3, result.get(2).getEventAsString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEquals("Delimiter: '" + delimiter + "'", result.size(), 3); | |
assertEquals(event1, result.get(0).getEventAsString()); | |
assertEquals(event2, result.get(1).getEventAsString()); | |
assertEquals(event3, result.get(2).getEventAsString()); | |
List<String> actual = result.stream().map(e -> e.getEventAsString()).collect(Collectors.toList()); | |
assertEquals("Delimiter: '" + delimiter + "'", Arrays.asList(event1,event2,event3), actual); |
Less duplicate code. Also if the test fails, we know the result
value.
Delimiter: 'ccd' expected:<[abc, def, ghi]> but was:<[abcccddef, ghi]>
Expected :[abc, def, ghi]
Actual :[abcccddef, ghi]
@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli | |||
for (int i = 0; i < line.length(); i++) { | |||
if (line.charAt(i) != delimiterArray[delimiterCount]) { | |||
if (delimiterCount > 0) { | |||
for (int j = 0; j < delimiterCount; ++j) { | |||
sb.append(delimiterArray[j]); | |||
boolean prefixMatch = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole method is too fancy to me. I may fail your interview if the question is to fix this bug.
Could anyone tell me why this one liner is not sufficient to replace the whole method?
return Arrays.stream(bf.lines().collect(Collectors.joining("")).split(Pattern.quote(delimiter))).map(e -> new MantisServerSentEvent(e)).collect(Collectors.toList());
(of course please make that multiple lines 🤣 )
It passes all your tests (new and old).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is memory. I think this was meant to check lines as they are read in BufferedReader, so never need to pull the entire content into a String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Calvin is correct -- performance and memory was the reason for this. Most events are pretty small but some people are sending monstrous (many megabyte +) events through Mantis. In the PlayAPI req/resp body logging use case batches can reach several gigabytes every few milliseconds. The large strings and the G1 collector don't get along very well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware of the large input, but that's a fair reason. I do not know if we can refactor the code in a way that is much simpler based on something like above, but we can tackle that separately I believe.
If we want to optimize memory/cpu for large input strings, there are also some other ideas.
- If the output strings could be large, we can let the
StringBuilder
have some initial capacity instead of letting it grow from beginning, e.g.sb = new StringBuilder(4096);
. - If every single output String can be too small, we can reuse the StringBuilder and just
setLength(0)
instead ofnew StringBuilder()
.
@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli | |||
for (int i = 0; i < line.length(); i++) { | |||
if (line.charAt(i) != delimiterArray[delimiterCount]) { | |||
if (delimiterCount > 0) { | |||
for (int j = 0; j < delimiterCount; ++j) { | |||
sb.append(delimiterArray[j]); | |||
boolean prefixMatch = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Calvin is correct -- performance and memory was the reason for this. Most events are pretty small but some people are sending monstrous (many megabyte +) events through Mantis. In the PlayAPI req/resp body logging use case batches can reach several gigabytes every few milliseconds. The large strings and the G1 collector don't get along very well.
On the note of the last commit to fix spotless. I've actually started adding a git
Makes sure that I don't push code that doesn't pass / won't build. |
I think I finally understand the patch. It's merged so it's great. FWIW, in the future if we want to revisit this, the reason of not processing the whole input string all at once (e.g. the one-liner) is the memory overhead. One optimization on top of the oneliner is buffered read. Sample code (passing all tests but not really tested). As the current approach is already well known in the team and I have limited knowledge about the payload, I'm not filing a PR this time. static List<MantisServerSentEvent> tokenize(BufferedReader bf, String delimiter) throws IOException {
final List<MantisServerSentEvent> ret = new ArrayList<>();
final StringBuilder sb = new StringBuilder();
final char[] buf = new char[4096];
while(bf.ready()) {
int count = bf.read(buf);
if (count == -1) break;
sb.append(buf, 0, count);
int idx = sb.lastIndexOf(delimiter);
if (idx == -1) continue;
ret.addAll(tokenize(sb.substring(0, idx), delimiter));
sb.delete(0, idx);
}
ret.addAll(tokenize(sb.toString(), delimiter));
return ret;
}
// one liner
static List<MantisServerSentEvent> tokenize(String str, String delimiter) throws IOException {
return Arrays.stream(str.replaceAll("\\n","").split(Pattern.quote(delimiter))).filter(s -> !s.isEmpty()).map(e -> new MantisServerSentEvent(e)).collect(Collectors.toList());
} |
Context
There is a bug in CompressionUtils if the delimiter has a prefix that matches the end of a segment, the data is not properly tokenized.
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all testsCONTRIBUTING.md