-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protect new implementation of BufferedTokenizer against OOM #17275
Comments
@andsel Did you check in the inputs (tcp/file) that use the BufferedTokenizer whether this behaviour can occur? eg (from @jsvd)
|
It happens also with TCP input, using the following pipeline, with testing data originated from logstash-plugins/logstash-codec-json_lines#43
and streaming the big json with netcat:
TCP test logs
With the input file Logstash will fail because at https://github.com/logstash-plugins/logstash-input-file/blob/55a4a7099f05f29351672417036c1342850c7adc/lib/filewatch/watched_file.rb#L250 expect an array, so the iterable has to be converted to a Ruby array and the proposal PR implementation throws a |
In #17229 is applied an implementation change to the BufferedTokenizer. It switched from RubyArray and JRuby classes to pure Java, so that the method
extract
now return anIterable
instead of theRubyArray
.The implementation in split in two parts:
extract
method invocationsizeLimit
verification throwing an exception when it receives a token longer thansizeLimit
.Despite the implementation seems more linear, it reintroduces a problem that was surfaced in logstash-plugins/logstash-codec-json_lines#43.
The inner iterator uses a
StringBuilder
as accumulator, and on every call of theextract
, it dices the tokens by separator.Now if a very big payload is pushed down to a codec that uses this BufferedTokenizer, where the payload simply require multiple invocations of
extract
before resulting in a tokenization (that could also never happen, suppose an offending client that never send the separator), the memory gets filled with data that is useless and would be rejected by the next iterator, but simply never reach it because goes OOM before.In the previous implementation the tokenization part, when firstly detected an overrun of the sizeLimit, dropped every fragment till the next separator was presented, protecting from OOM conditions like this.
Now are my questions:
The text was updated successfully, but these errors were encountered: