Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in CompressionUtils that fails with certain delimiters #154

Merged
merged 6 commits into from
Mar 25, 2022

Conversation

calvin681
Copy link
Collaborator

@calvin681 calvin681 commented Mar 25, 2022

Context

There is a bug in CompressionUtils if the delimiter has a prefix that matches the end of a segment, the data is not properly tokenized.

Checklist

  • ./gradlew build compiles code correctly
  • Added new tests where applicable
  • ./gradlew test passes all tests
  • Extended README or added javadocs where applicable
  • Added copyright headers for new files from CONTRIBUTING.md

}
}

@Test public void testMultiline() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the annotation and method definition are on the same line. is it a new formatting rule?

@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli
for (int i = 0; i < line.length(); i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this comment. perhaps there is a way to achieve this using regex or something.

@hmitnflx
Copy link
Collaborator

hmitnflx commented Mar 25, 2022

Thanks for catching and fixing this, really great!

assertEquals(event2, result.get(1).getEventAsString());
assertEquals(event3, result.get(2).getEventAsString());

} catch (IOException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to catch this exception here. Just make the test method throw exception, and the test will simply fail in case of IOE. Same in the other test.

Comment on lines 85 to 88
assertEquals("Delimiter: '" + delimiter + "'", result.size(), 3);
assertEquals(event1, result.get(0).getEventAsString());
assertEquals(event2, result.get(1).getEventAsString());
assertEquals(event3, result.get(2).getEventAsString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertEquals("Delimiter: '" + delimiter + "'", result.size(), 3);
assertEquals(event1, result.get(0).getEventAsString());
assertEquals(event2, result.get(1).getEventAsString());
assertEquals(event3, result.get(2).getEventAsString());
List<String> actual = result.stream().map(e -> e.getEventAsString()).collect(Collectors.toList());
assertEquals("Delimiter: '" + delimiter + "'", Arrays.asList(event1,event2,event3), actual);

Less duplicate code. Also if the test fails, we know the result value.

Delimiter: 'ccd' expected:<[abc, def, ghi]> but was:<[abcccddef, ghi]>
Expected :[abc, def, ghi]
Actual   :[abcccddef, ghi]

@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli
for (int i = 0; i < line.length(); i++) {
if (line.charAt(i) != delimiterArray[delimiterCount]) {
if (delimiterCount > 0) {
for (int j = 0; j < delimiterCount; ++j) {
sb.append(delimiterArray[j]);
boolean prefixMatch = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole method is too fancy to me. I may fail your interview if the question is to fix this bug.

Could anyone tell me why this one liner is not sufficient to replace the whole method?

return Arrays.stream(bf.lines().collect(Collectors.joining("")).split(Pattern.quote(delimiter))).map(e -> new MantisServerSentEvent(e)).collect(Collectors.toList());

(of course please make that multiple lines 🤣 )

It passes all your tests (new and old).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is memory. I think this was meant to check lines as they are read in BufferedReader, so never need to pull the entire content into a String.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Calvin is correct -- performance and memory was the reason for this. Most events are pretty small but some people are sending monstrous (many megabyte +) events through Mantis. In the PlayAPI req/resp body logging use case batches can reach several gigabytes every few milliseconds. The large strings and the G1 collector don't get along very well.

Copy link
Contributor

@liuml07 liuml07 Mar 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of the large input, but that's a fair reason. I do not know if we can refactor the code in a way that is much simpler based on something like above, but we can tackle that separately I believe.

If we want to optimize memory/cpu for large input strings, there are also some other ideas.

  1. If the output strings could be large, we can let the StringBuilder have some initial capacity instead of letting it grow from beginning, e.g. sb = new StringBuilder(4096);.
  2. If every single output String can be too small, we can reuse the StringBuilder and just setLength(0) instead of new StringBuilder().

@@ -217,10 +217,19 @@ private static int getTotalByteSize(List<List<byte[]>> nestedEvents, byte[] deli
for (int i = 0; i < line.length(); i++) {
if (line.charAt(i) != delimiterArray[delimiterCount]) {
if (delimiterCount > 0) {
for (int j = 0; j < delimiterCount; ++j) {
sb.append(delimiterArray[j]);
boolean prefixMatch = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Calvin is correct -- performance and memory was the reason for this. Most events are pretty small but some people are sending monstrous (many megabyte +) events through Mantis. In the PlayAPI req/resp body logging use case batches can reach several gigabytes every few milliseconds. The large strings and the G1 collector don't get along very well.

@codyrioux
Copy link
Contributor

On the note of the last commit to fix spotless. I've actually started adding a git pre-push hook in repos I work in:

#!/usr/bin/env bash
gw :spotlessApply
gw clean build

Makes sure that I don't push code that doesn't pass / won't build.

@calvin681 calvin681 merged commit da92977 into Netflix:master Mar 25, 2022
@hmitnflx hmitnflx mentioned this pull request Mar 26, 2022
5 tasks
@liuml07
Copy link
Contributor

liuml07 commented Mar 26, 2022

I think I finally understand the patch. It's merged so it's great.

FWIW, in the future if we want to revisit this, the reason of not processing the whole input string all at once (e.g. the one-liner) is the memory overhead.

One optimization on top of the oneliner is buffered read. Sample code (passing all tests but not really tested). As the current approach is already well known in the team and I have limited knowledge about the payload, I'm not filing a PR this time.

    static List<MantisServerSentEvent> tokenize(BufferedReader bf, String delimiter) throws IOException {
        final List<MantisServerSentEvent> ret = new ArrayList<>();
        final StringBuilder sb = new StringBuilder();
        final char[] buf = new char[4096];
        while(bf.ready()) {
            int count = bf.read(buf);
            if (count == -1) break;
            sb.append(buf, 0, count);
            int idx = sb.lastIndexOf(delimiter);
            if (idx == -1) continue;
            ret.addAll(tokenize(sb.substring(0, idx), delimiter));
            sb.delete(0, idx);
        }
        ret.addAll(tokenize(sb.toString(), delimiter));
        return ret;
    }

    // one liner
    static List<MantisServerSentEvent> tokenize(String str, String delimiter) throws IOException {
        return Arrays.stream(str.replaceAll("\\n","").split(Pattern.quote(delimiter))).filter(s -> !s.isEmpty()).map(e -> new MantisServerSentEvent(e)).collect(Collectors.toList());
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants