-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lumberjack to Lumberjack data loss under congestion #3691
Comments
There is some possible cause of this behavior that I will investigate:
|
|
Edited to make mention of the |
From discussion on hipchat with @jordansissel and @jsvd Summary:
|
@ph haha you are too fast ;) |
@jordansissel It because of zoom for iphone. |
I've tried a few of the things we have discussed,
begin
@circuit_breaker.execute { @buffered_queue << event }
rescue LogStash::CircuitBreaker::OpenBreaker,
LogStash::CircuitBreaker::HalfOpenBreaker => e
logger.error("!!!!Forcing a close on the connection")
connection.close
end I am revisiting my conditions to check If I can narrow the problem more. |
Also by looking at the code I dont explicitly need to rescue it, since the |
@ph yeah, the idea is that we trend towards "at least once" delivery. A failure in a batch causing the connection to be terminated will cause LSF to retransmit the whole batch (the full window size, likely). This causes duplicates, but it is better than loss. Long term (filebeats) we we'll likely do partial acks (like log-courier) or something similar to reduce duplicates while still achieveing at-least-once delivery. |
I was expecting duplicates, actually I saw some on the log and I was expecting this behavior. |
hmm, ok. Let me know what I can help test. |
Some missing and some duplicates {"message":"20","@version":"1","@timestamp":"2015-08-04T19:09:01.333Z","file":"/tmp/log","host":"sashimi","offset":"50"}
{"message":"21","@version":"1","@timestamp":"2015-08-04T19:09:01.335Z","file":"/tmp/log","host":"sashimi","offset":"53"}
{"message":"22","@version":"1","@timestamp":"2015-08-04T19:09:01.335Z","file":"/tmp/log","host":"sashimi","offset":"56"}
{"message":"23","@version":"1","@timestamp":"2015-08-04T19:09:01.335Z","file":"/tmp/log","host":"sashimi","offset":"59"}
{"message":"20","@version":"1","@timestamp":"2015-08-04T19:09:41.830Z","file":"/tmp/log","host":"sashimi","offset":"50"}
{"message":"22","@version":"1","@timestamp":"2015-08-04T19:09:43.838Z","file":"/tmp/log","host":"sashimi","offset":"56"}
{"message":"25","@version":"1","@timestamp":"2015-08-04T19:10:22.233Z","file":"/tmp/log","host":"sashimi","offset":"65"}
{"message":"29","@version":"1","@timestamp":"2015-08-04T19:10:28.251Z","file":"/tmp/log","host":"sashimi","offset":"77"}
{"message":"30","@version":"1","@timestamp":"2015-08-04T19:10:26.297Z","file":"/tmp/log","host":"sashimi","offset":"80"}
{"message":"30","@version":"1","@timestamp":"2015-08-04T19:10:28.282Z","file":"/tmp/log","host":"sashimi","offset":"80"}
{"message":"31","@version":"1","@timestamp":"2015-08-04T19:10:28.939Z","file":"/tmp/log","host":"sashimi","offset":"83"} |
I am using this https://gist.github.com/jsvd/cbb2371f9da1e3a733cc to run my test. |
I took some steps back. Okay, I have fixed some exception handling issues (blocks and threads sometime are no fun).
This is what I have found so far: We still lose events I have added the following logs on different part of the event lifecycle in the plugins.
Event that raised an exception: https://gist.github.com/322b0e9a1fee3ce4061f Look at document 25:
So something is fishy on the ack/retry side. |
Add thread number in the debug files |
Event that raised an exception: https://gist.github.com/d955b79b547ab45ec300 |
Adding another log to know which thread succesfully send an event down the pipeline. |
I've removed the reuse thread logic an the event that raised an exception is easier to follow. |
If we look at the events we see how LSF is retrying to resend some events We fail at event 22, so the LSF try to send back from 17 his last ack point and fail at 18. See.
Note I didnt see any timeout connection the LSF side so all the reconnection originated from the server. |
Event received and ack, look around 24
|
After a lot of logging in different place of the application and pairing with @jordansissel on this issue we have found the problem, but lets take a few notes here:
There was an issue on server side to decide when to actually send the ACK, so on retry instead of sending the sequence number of the last event we were sending the sequence number of the first event received. Since we aren't doing any verification on the sequence number LSF think that the whole payload was correctly received by LSF, also after this situation the sequence number on the LSF and the server side become out of sync.
|
Since this issue also talk about the The ruby client uses a statically defined windows size at the beginning of the connection and assume the window to be 5000 and doesn't allow you to send events in bulk. So the ruby client check for the ACK every 5000 events. This explain the difference of behavior in the congestion scenario between the LSF and the lumberjack output. |
Configuring the window_size to 1 to the lumberjack could be a temporary solution. |
I've reverted the |
I have also used this PR elastic/logstash-forwarder#508 to debug the issue. |
Fixed in logstash 1.5.4 |
Sending from Logstash A to Logstash B using lumberjack input/output plugins results in data loss if Logstash B suffers back pressure from its filters/outputs. (The Forwader show the same behavior)
My test using two logstash instances (Client and Server):
Client config:
Server config:
Starting the server then the client and allowing both to run for a while makes the server output flows such as:
The text was updated successfully, but these errors were encountered: