-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM when using the json_lines codec #210
Comments
We already know this. It is double boundary detection, once in the file input and again in any Maybe it is time now to revisit this problem of boundary detection, multiline reassembly and codec mismatching. |
@guyboertje I hear you - but OTOH why wouldn't the file input behave similarly to the stdin input in that respect? "input line" + delimiter -> Why wouldn't this work with the |
I don't think the stdin input consumes the newline. |
no it doesn't, it's the codec's job to do so hense the in #211 you mentioned "Add support to bypass the delimiter boundary detection" - it would probably work for this scenario? From a conceptual perspective and in today's architecture reality I am not sure why the file input should behave differently and not use a line codec for example to deal with delimited lines, or any other codec in that respect for decoding. Isn't the current identity map logic in the file input good for mapping codecs with each stream identity? |
This is exactly what sparked the Mills mini pipeline inside an input convo. Philosophically, the line codec is not a codec, it is a boundary detector and the multiline codec is a boundary detector that assembles lines subject to some boundary being detected. Take the graphite codec - it is not a true codec at all, it is a line boundary detector with a mutate split and date filter equivalent. I guess it was conceived to remove the need to know to use line + mutate + date or line + graphite filter (if one was made). The collectd codec is a true codec but it expects that the input is supplying logical "lines" of data. The cef codec can optionally do boundary detection if a delimiter is specified. Such conditional operation means if we are to have a codec advertise whether it will handle the boundary detection, this needs to be dynamic. The multiline codec can be changed to accept chunks and not lines. I suppose the question is to what degree are inputs supposed to supply transactional units of data? Even if those units contain multiple "lines" they should not have it that some of this unit's data "belongs" to the next unit in the stream (as would happen if the file input made no attempt to boundary detect - by, say, holding back the data that follows the last delimiter in a chunk and then appending the next chunk to it). |
@guyboertje I get what you are saying and I am +1 on reopening this discussion but I don't think here is the best best place to have it. With what we have today in terms of input/codec architecture, I still don't see why we couldn't use a |
To test this consider the test plan explained here also without limitation on Change the input section, eliminating the
|
The
json_lines
codec used in thefile
input crashes logstash with an OOM.To reproduce:
The text was updated successfully, but these errors were encountered: