-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Stuck" Kinesis Shards #185
Comments
I don't believe this is a bug in the core Java KCL, but could be a bug in the MultiLang component.
You shouldn't need to do this, since each record processor is single threaded. If your record processor does its work on the main thread, adding more threading capacity will be wasted. What you're seeing happen is the lack of response from when the record were dispatched to the language component. The Java record processor is waiting for response from the language components before returning from the processRecords call. The MultiLang Daemon implements the Java record processor interface and acts as a forwarder to the language specific record processor. The dispatch behavior of the Java record processor is to make a call to processRecords, and pause activity until that call returns. The MultiLang Daemon implements the same behavior, but across process boundaries. The MultiLang Daemon, acting as a forwarder, needs to wait for a response from the native language record processor. If for some reason the response never comes, it will hang indefinitely. There isn't a good way to see what is actually causing the problem it could be any of three components:
I think for a short term fix adding a configuration that would limit the maximum time the MultiLang Daemon will wait for a response could limit the impact of these sort of issues. This fix would terminate the record processor if a response isn't received within the configured time. Longer term will require adding some additional logging to understand the state of components on both side of the process boundary. This would help us detect which component is losing track of the current state. For everyone else please comment, or add a reaction to this to help us know how many people are affected by this issue. |
@pfifer thanks for looking into this. Let me know if you need anything from me to help you debug this. That short term fix would be pretty helpful as well, so if you get that implemented, please let me know. Thanks! |
@pfifer any update on this? Even implementing some sort of short term fix would be extremely helpful. |
+1 |
We're currently working on the short term fix for this. |
To help us troubleshoot the source of this issue can you enable more detailed logging using these steps: If you don't mind using Apache Maven this is a pom.xml that will provide additional logging, which can help diagnose issues. To create the launcher script:
The To provide the configuration file for logging follow these steps:
|
@pfifer, I've been working with Alois on this issue as well, just getting back with some more information. We now have the debug logs streaming to our ELK stack, so if we see a case where the Kinesis stream seems to be stuck we can hopefully get logs from our process that processes the stream. We were also investigating DynamoDB, and we can see that when the Kinesis stream is no longer getting processed, we see our tables We have seen this on all the DynamoDB tables that are created by our processors. We have 10 kinesis stream that have gotten "stuck" in the last day, and on all their DynamoDB tables we see high latency at the same time. Some of us are wondering if the request to DynamoDB is getting dropped or something, and the caller waits for a response indefinitely, pausing the process. We've also noticed that these incidents seem to happen together in a one to two hour window. Last night (7/28/2017) at around 1:30 AM five of our Kinesis streams became "stuck" within 5 min of each other. Then two other streams became "stuck" at 4:50 AM and 5:50 AM respectively. It took a bit of wrangling to get the logs from KCL since we use the |
@pfifer We finally got some logs for ya! You can see them at https://gist.github.com/chosenken/e79f3eda5eefb5bd6a910ae6e5875d4a#file-kcl-log-L699. Specifically on line 699, I see the Please let me know if you need any more information, thank you! |
Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor. If the record processor doesn't respond within the timeout the parent Java process will be terminated. This is a temporary fix to handle cases where the KCL becomes blocked while waiting for a client record processor. The timeout for the this can be set by adding `timeoutInSeconds = <timeout value>`. The default for this is no timeout. Setting this can cause the KCL to exit suddenly, before using this ensure that you have an automated restart for your application Related awslabs#195 Related awslabs#185
Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor. If the record processor doesn't respond within the timeout the parent Java process will be terminated. This is a temporary fix to handle cases where the KCL becomes blocked while waiting for a client record processor. The timeout for the this can be set by adding `timeoutInSeconds = <timeout value>`. The default for this is no timeout. Setting this can cause the KCL to exit suddenly, before using this ensure that you have an automated restart for your application Related #195 Related #185
I'm not sure if this is the same case, but running the kcl sample project with a stream of 4 shards, the first shard shardId-00000000 is initialized, but recordProcess is never called for that shard. In dynamoDB, checkpoint column always stays as LATEST. Re-running the application doesn't seems to work and I get no errors. Is this issue related to what you are trying to fix? Thank you! Is there a workaround in the meantime? |
@oscarjiv91 I don't think your issue is related to this one, but to be sure I would need some additional information. Could you provide the link to the sample that you are using? Another issue could be that there are not records being pushed to your stream, in that case the checkpointer will never advance, because there is no record to checkpoint against. You can check that by enabling callProcessRecordsEvenForEmptyRecordList in the properties file. This enables calling of the recordProcessor even if there are no records on the shard. If you need further assistance, I would recommend you open up a thread on the AWS Kinesis Forum. |
This is the link: https://github.com/awslabs/amazon-kinesis-client-nodejs/tree/master/samples/click_stream_sample I'm pushing records with the same sample project, and set 4 shards. If I use 3 shards, I have the same problem (shardId-00000000 is stuck), but not with 2 shards which works well. Thank you. I will open up a thread on the Kinesis Forum. |
We have a KCL app working, not on production yet, but it showed similar behaviour. Two weeks ago it just stopped fetching records, after a restart it would get back up. I had to set up a cron to restart the process every day in order to avoid this. |
@juanstiza Are you using the Java KCL or the Multilang? |
@sahilpalvia We are using Java. |
+1, Java KCL |
Update: After all the source of the problem was that all workers had the same |
In Release 1.8.8 we added the ability to log a message when a processing task is blocked for a long time. For those affected by it, can you enable the log message. Additionally if you see the log message could you retrieve a stack trace from the Java application. On Linux, and macOS you can do this with the jstack using a command like |
@pfifer we're seeing similar behavior on every new deployment of our kcl cluster. when a new AWS KCL instance is coming up and takes the lease it starts processing records from
It then processes the records and the delay goes down. FYI we're using Java kcl v 1.7.5 |
@shaharck The high iterator age can occur for a number of reasons. If it's caused by processing getting stuck you should see the metrics for the stuck shard stop being produced. If the processing is actually stuck if you could get a stack trace via If the metrics are continuous it indicates that processing is occurring. The common cause of iterator spikes in this case are due to not having a recent checkpoint. This can occur due to a number of reasons, but commonly the causes are checkpoint strategy, or a stream receiving no data. When there is no data on the stream your application will be unable to checkpoint at its current position in time. So if the lease is lost processing for the shard will restart at the last checkpoint which will be the last time there was data available on the stream. There is a third possibility: Your record processor is getting stuck. The processing of a shard is synchronous. The KCL waits for the record processor to finish processing a batch of records before it will retrieve, and dispatch the next batch. If your record processor gets stuck for any reason this can cause processing on the affected shard to stop. To help detect these situations we added a new log message in Release 1.8.8. The message is emitted whenever processing is paused for an extended period of time waiting for the record processor to return. |
This issues has only been reported, and confirmed on Node.ja multi-lang daemon. It's possible that this may occur with other multi-lang daemon clients, but we haven't been able to confirm that. We have no indications that this affects the Java KCL. The most common cause of this issue in the Java KCL is due to the record processor blocking. The KCL dispatches batches of records serially, and must wait for the call to In release 1.8.8 we added If your application appears to be stuck one of the first options is to retrieve thread stack traces from the JVM. The simplest way is to use Here is an example of a blocked thread retrieved using
|
PutRecord/PutRecords isn't related to the KCL. You appear to have run into an issue with the AWS Node SDK. I would recommend opening an issue on the aws-sdk-js repository |
@pfifer we are running a production instance of Kinesis Streams + PySpark Streaming. Time to time, and recently it's more often (twice a day), the application stops processing records, indefinitely. It can go like this for hours, until we restart the receiver. Now we have some alarms to alert us about this situation. We couldn't find anything in the logs, but the way we have to recover the Spark Application is by manually killing the running receiver. Spark spawns a new receiver and the records start coming along, just fine. Our issue seems to be related to this thread. We are still looking for a solution, or at least a workaround to avoid the manual restart. Any help here would be gratefully accepted!. UPDATE 2018-09-15:
|
@pfifer We are using Java KCL Version 1.8.7 in springboot application(with docker) in testing and staging environments(Going to production soon). We are using kinesis stream with 2 shards. We have been noticing similar behavior mentioned in this thread - The KCL application is not reading messages from kinesis stream even there are new messages in the stream. We terminated the EC2 instance to spin up new one but that didn't help. However, after restarting the docker container it started processing messages from kinesis stream. That is the temporary workaround for now for us.
|
@pfifer I'm facing the same issue as other folks here. I'm using Java implementation of KCL to read stream records from dynamodb. Currently we have 1 shard and 2 workers with unique workerId running in separate instances. Once in awhile, I see that the 1 shard that we have reaches to end (shard expires in 4 hours) and a new shard gets created (verified using dynamodb describe stream api), but KCL doesn't create a new lease for the new shard. As a result the application doesn't process any data. The only way to fix the problem is to bounce the instances and KCL will create a new lease and will start processing the records. Before bouncing the instances this time, I ran jstack and here is what I got. Java KCL version: 1.7.6 https://gist.github.com/htarevern/e7429e4d8e27658bc1404d80d09011a9 |
@htarevern If you're using the DynamoDB Streams plugin your issue maybe related to awslabs/dynamodb-streams-kinesis-adapter#20 |
@ManojGH I would make sure that your record processor isn't blocking. The KCL will wait for the record processor to return before retrieving the next batch of records.
|
@gonzalodiaz Your exception indicates something is timing out in Spark. If it's the KCL we would need logs from KCL. For the Spark adapter the KCL runs on a single node dispatchin which is dispatching work from a single node. |
@pfifer with client v1.8.1 having below issue.
Any clue, if this is my issue? I see sometimes checkpoint gets updated, sometimes throws above error and it delivers again those messages back to consumer. appericiate your quick response. |
+1 |
I am consuming records from a Kinesis stream via the Spark Kinesis receiver library and encountering the same issue described here. However, in my case, record processing does occur once I send a sigterm to the spark process, but only then. I was able to confirm using
|
Was there any resolution for this issue? We are running into it on KCL |
adding a |
Hello, looks like this issue is still open so I am assuming there is no fix yet? I also cannot find the DynamoDB table used as the Kinesis metadata store. |
@pfifer I've been using Kinesis as my real time data streaming pipeline for over a year now, and I am consistently running into an issue where Kinesis shards seem to randomly stop processing data. I am using the Node.js KCL library to consume the stream.
Here are some screenshots to show what I mean.
You can see here that from roughly 21:00 to 6:00, shards 98 and 100 stop emitting data for DataBytesProcessed. I restarted the application at 6:00 and the consumer immediately started processing data again.
Now here is a graph of the IteratorAgeMilliseconds for the same time period.
The shards are still emitting IteratorAge from 21:00 to 6:00 and show that the IteratorAgeMilliseconds is still at 0, so it seems like the KCL was still running during that time, but it wasn't consuming more data from the stream. You can see that when I restarted the application at 6:00, the KCL realized it was very far behind and the IteratorAge spiked up to 32.0M instantly. I can confirm that the KCL process was still running during that time period and continuously logged out the usual messages:
There are 2 other shards in the same stream and they continued to process data as normal. I have been working on debugging this on my end for several months now, but have come up empty. I am catching all exceptions thrown in my application, and I have set timeouts on any network requests or async I/O operations, so I don't think the process is crashing or getting hung up on anything.
I also thought that maybe I didn't have enough resources allocated to the application and something weird was happening with that and that's why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.
I cannot seem to reliably replicate this issue; it seems to just happen randomly across different shards and can range from happening a few times a week to a few times a day. FYI, I have several other Node applications using the KCL that also experience the same issue.
I have seen this issue where he seems to have the same problem with the Python library which he solved by just restarting the applications every 2 hours...and also this issue which seems to be the same problem I have as well.
Is this a known bug with the KCL? And if the problem is on my end, do you have any pointers on how to track down this problem?
The text was updated successfully, but these errors were encountered: