Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza/fileconsumer] Fix long line parsing #32100

Merged
merged 2 commits into from
Apr 23, 2024

Conversation

OverOrion
Copy link
Contributor

@OverOrion OverOrion commented Apr 2, 2024

Description:
Flush could have sent partial input before EOF was reached, this PR fixes it.

Link to tracking Issue: #31512, #32170

Testing: Added unit test TestFlushPeriodEOF

Documentation: Added a note to force_flush_period option

@OverOrion OverOrion requested a review from djaglowski as a code owner April 2, 2024 12:44
@OverOrion OverOrion requested a review from a team April 2, 2024 12:44
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch 2 times, most recently from 9683fd8 to 954249f Compare April 2, 2024 13:11
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @OverOrion and @ChrsMark for finding this. Unfortunately, I do not believe this implementation works.

The underlying array may point to data that will be overwritten by a subsequent call to Scan.

This certainly seems to be a problem. To articulate the issue a bit more, we are not calling Scan again until we emit the token. However, because the token is emitted as a slice which directly references the scanner's buffer, it's contents may change later.

Possible solutions then would seem to be:

  1. Copy the token into a new slice to ensure the underlying contents will not change.
  2. Clearly advise emit funcs that they may need to do this, depending on their need for correctness vs performance. (And then handle the copy in our emit funcs)

I think it's fair of us to prioritize correctness but it will be interesting to see the performance impact associated with copying every token. To that point, we will certainly need unit tests for this as well as benchmark results to understand the tradeoff we are introducing.

Longer term, we may want to look at replacing the scanner altogether, as it's not necessarily the most performant solution. It does however provide a high degree of robustness that will be difficult to reproduce.

pkg/stanza/fileconsumer/internal/reader/reader.go Outdated Show resolved Hide resolved
@OverOrion OverOrion marked this pull request as draft April 4, 2024 07:19
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch 6 times, most recently from 224b323 to 36c8f1d Compare April 5, 2024 14:37
@OverOrion OverOrion requested a review from djaglowski April 5, 2024 14:38
@OverOrion OverOrion marked this pull request as ready for review April 5, 2024 14:39
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, finally starting to catch up.

receiver/filelogreceiver/README.md Outdated Show resolved Hide resolved
pkg/stanza/fileconsumer/internal/reader/factory_test.go Outdated Show resolved Hide resolved
pkg/stanza/flush/flush.go Outdated Show resolved Hide resolved
@OverOrion
Copy link
Contributor Author

Thanks @djaglowski @ChrsMark, will address this today!

Flush should only happen if the scanner reached EOF

Signed-off-by: Szilard Parrag <szilard.parrag@axoflow.com>
@OverOrion OverOrion force-pushed the stanza-fix-scanner branch from 36c8f1d to c8e274b Compare April 22, 2024 06:55
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just correcting the description a bit, since we update the flush timeout whenever new data is found, even if we do not emit a token.

receiver/filelogreceiver/README.md Outdated Show resolved Hide resolved
Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
@djaglowski djaglowski merged commit ba22e43 into open-telemetry:main Apr 23, 2024
156 checks passed
@github-actions github-actions bot added this to the next release milestone Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants