Lots of warnings #399

emlyn · 2018-09-14T23:26:07Z

I'm seeing lots of warning messages like the following:

18-09-13 13:32:52 6ad8398803c3 WARN [s.a.k.retrieval.fanout.FanOutRecordsPublisher:128] - shardId-000000000152: [SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred) @ 2018-09-13T13:32:19.886Z id: shardId-000000000152-81 -- ReadTimeout
           io.netty.handler.timeout.ReadTimeoutException:
software.amazon.awssdk.core.exception.SdkClientException:
                java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException

18-09-13 13:32:59 6ad8398803c3 WARN [s.a.k.retrieval.fanout.FanOutRecordsPublisher:312] - shardId-000000000152: (FanOutRecordsPublisher/Subscription#request) - Rejected an attempt to request(6), because subscribers don't match.

18-09-13 13:32:59 6ad8398803c3 WARN [s.a.kinesis.lifecycle.ShardConsumer:165] - shardId-000000000152: onError().  Cancelling subscription, and marking self as failed.
           io.netty.handler.timeout.ReadTimeoutException:
software.amazon.awssdk.core.exception.SdkClientException:
                java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException

18-09-13 13:33:00 6ad8398803c3 WARN [s.a.kinesis.lifecycle.ShardConsumer:244] - shardId-000000000152: Failure occurred in retrieval.  Restarting data requests
           io.netty.handler.timeout.ReadTimeoutException:
software.amazon.awssdk.core.exception.SdkClientException:
                java.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException

Repeated every few minutes. Other than this, everything appears to be running correctly. Is it anything to be worried about?
I'm not sure if this is related to #391, but the errors look different so I filed a separate issue.

The text was updated successfully, but these errors were encountered:

geovaneferreira · 2018-09-17T12:22:58Z

i have the same problem..

pfifer · 2018-09-17T14:41:10Z

This can occur if the record processor takes more than 30 seconds to complete processRecords. The KCL will automatically reconnect and start processing data again. In the future we will not surface these exceptions, and just automatically reconnect.

pfifer · 2018-10-08T23:29:54Z

We just released version 2.0.3 of the Amazon Kinesis Client Library which should fix this issue.

Feel free to reopen this if you're still seeing the issue.

majidfn · 2018-11-16T21:32:51Z

Hi @pfifer , I'm on 2.0.5 and I still get lots of those warnings in my tests.
I have 4 shards in my stream and my application is running in 2 ECS instances benefiting from enhanced fanout.

{
    "timestamp": "2018-11-16T21:15:11.016Z",
    "logger": "software.amazon.kinesis.lifecycle.ShardConsumer",
    "exceptionClass": "software.amazon.kinesis.retrieval.RetryableRetrievalException",
    "stackTrace": "software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java: 142)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java: 51)
	at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java: 516)
	at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java: 2102)
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java: 760)
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java: 736)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 155)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.lambda$execute$0(AsyncRetryableStage.java: 121)
	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java: 822)
	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java: 797)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java: 474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java: 1977)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$Completable.lambda$completeExceptionally$1(MakeAsyncHttpRequestStage.java: 208)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: 1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 624)
	at java.lang.Thread.run(Thread.java: 748)
Caused by: software.amazon.awssdk.core.exception.SdkClientException
	at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java: 97)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.handle(AsyncRetryableStage.java: 143)
	... 9 more
Caused by: io.netty.handler.timeout.ReadTimeoutException",
    "thread": "ShardRecordProcessor-0002",
    "exceptionMessage": "ReadTimeout",
    "message": "ShardConsumer: shardId-000000000003: onError().  Cancelling subscription, and marking self as failed.",
    "logLevel": "WARN"
}

and also

{
    "timestamp": "2018-11-16T21:19:03.243Z",
    "logger": "software.amazon.awssdk.http.nio.netty.internal.RunnableRequest",
    "exceptionClass": "io.netty.channel.ConnectTimeoutException",
    "stackTrace": "io.netty.channel.ConnectTimeoutException: connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:464)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)",
    "thread": "aws-java-sdk-NettyEventLoop-2-0",
    "exceptionMessage": "connection timed out: monitoring.us-east-1.amazonaws.com/ip.ip.ip.ip:443",
    "message": "RunnableRequest: Failed to create connection to https://monitoring.us-east-1.amazonaws.com",
    "logLevel": "ERROR"
}

and finally:

{
	"timestamp": "2018-11-16T21:16:08.397Z",
	"logger": "software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher",
	"thread": "ShardRecordProcessor-0001",
	"message": "FanOutRecordsPublisher: shardId-000000000003: (FanOutRecordsPublisher/Subscription#request) - Rejected an attempt to request(6), because subscribers don't match.",
	"logLevel": "WARN"
}

saeedansari · 2019-02-08T20:33:31Z

@pfifer Hi Justin, we are on KCL 2.0.4 and getting the same order of warning logs but it is not a read timeout. Following is the shortened exception stack for a specific RequestID.

[SubscriptionLifetime] - (FanOutRecordsPublisher#errorOccurred)
.kinesis.model.InternalFailureException: Internal Service Error (Service: kinesis, Status Code: 500, Request ID: 3386c68ecb4a)
AwsJsonErrorResponseHandler.handle(AwsJsonErrorResponseHandler.java:86)
.....


ShardRecordProcessor-0043][0;39m [31mWARN [0;39m s.a.kinesis.lifecycle.ShardConsumer.onError - shardId-000000000001: onError().  Cancelling subscription, and marking self as failed.
.....


s.a.kinesis.lifecycle.ShardConsumer.healthCheck - shardId-000000000001: Failure occurred in retrieval.  Restarting data requests

In what circumstances we get this error?

clocklear · 2019-02-21T20:44:21Z

Hi @pfifer, similar to other folks here, we're seeing this behavior on 2.1.2. We're seeing lots of warnings like this:

2019-02-21 15:39:3220:39:32.646 [ShardRecordProcessor-0072] WARN s.a.k.l.ShardConsumerSubscriber - shardId-000000001088: onError(). Cancelling subscription, and marking self as failed.
2019-02-21 15:39:32software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
2019-02-21 15:39:32 at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java:163)
2019-02-21 15:39:32 at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java:54)
2019-02-21 15:39:32 at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java:553)
2019-02-21 15:39:32 at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java:2238)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
2019-02-21 15:39:32 at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallTimeoutTrackingStage.lambda$execute$1(AsyncApiCallTimeoutTrackingStage.java:62)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
2019-02-21 15:39:32 at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
...
<omitted for brevity>
...
2019-02-21 15:39:32 Caused by: software.amazon.awssdk.core.exception.SdkClientException: null
2019-02-21 15:39:32 at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:97)
2019-02-21 15:39:32 at software.amazon.awssdk.core.internal.util.ThrowableUtils.asSdkException(ThrowableUtils.java:98)
2019-02-21 15:39:32 at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryExecutor.retryIfNeeded(AsyncRetryableStage.java:118)
2019-02-21 15:39:32 ... 37 common frames omitted
2019-02-21 15:39:32Caused by: io.netty.handler.timeout.ReadTimeoutException: null

It is unclear to me whether this is something that we have misconfigured, something that just isn't working properly, or if its something we're not handling properly. Additionally, we 're seeing periodic failures in checkpointing due to access to DynamoDB being throttled (but that feels like a red herring to me, feels like its a symptom and not the actual problem).

Do you have any insight on what is going on, or what we can do differently?

Thanks!

eoliphan · 2019-08-09T11:51:02Z

Seeing something similar on 2.2.1. And this appears to end up causing a loss of some records being processed. Per the earlier comment is the 30 secs to process issue still a thing? Was a bit confused as the 2.0.3 update would keep this from happening?

software.amazon.kinesis.retrieval.RetryableRetrievalException: ReadTimeout
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.errorOccurred(FanOutRecordsPublisher.java:164) ~[app.jar:na]
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher.access$700(FanOutRecordsPublisher.java:54) ~[app.jar:na]
at software.amazon.kinesis.retrieval.fanout.FanOutRecordsPublisher$RecordFlow.exceptionOccurred(FanOutRecordsPublisher.java:568) ~[app.jar:na]
at software.amazon.awssdk.services.kinesis.DefaultKinesisAsyncClient.lambda$subscribeToShard$1(DefaultKinesisAsyncClient.java:2288) ~[app.jar:na]
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
at software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:75) ~[app.jar:na]
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[na:na]
at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[na:na]
at software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncApiCallTimeoutTrackingStage.lambda$execute$1(AsyncApiCallTimeoutTrackingStage.java:64) ~[app.jar:na]

ashwing · 2019-08-22T00:00:48Z

What to expect from KCL release 2.2.2?

Case 1: If you are consuming KCL 2.x version (prior to KCL 2.2.2) and you replaced or 
modified thread/queue configurations of SchedulerThreadPoolExecutor and faced situations 
like data loss or shard getting struck or intermittent pause in data consumpton, please 
continue reading, else go to next case. These issues are potentially due to undelivered 
executor service events.
 	a. 	This release ensures data loss does not happen, by preventing subsequent events 
 		from getting delivered upon an event delivery failure and later restarting the subscription. 
 		This may, however, lead to intermittent pause in data consumption, as KCL checks for 
 		exceptionally completed subscriptions every 35 secponds. Refer section d for mitigation. 
 		This scenario is identified by the log messages,
 			- ERROR s.a.k.r.f.FanOutRecordsPublisher [NONE] - shardId-000000000035: Received unexpected ack for the active subscription shardId-000000000077-005. Throwing.
 			- ERROR s.a.k.l.ShardConsumerSubscriber [NONE] - shardId-000000000035: Last request was dispatched at 2019-08-21T01:16:22.340Z, but no response as of 2019-08-21T01:16:57.902Z (PT35.562S).  Cancelling subscription, and restarting.
 	b. 	This release prevents the shard stuck situations that happened due to undelivered 
 		control plane messages. [NOTE : Control plane messages are responsible for fetching 
 		further events from KDS service. This includes subscribeToShard() API call and reactive 
 		layer's request(N) calls]
 	c. 	This release, have improved logging that will capture undelivered control plane 
 		messages for troubleshooting. Any undelivered control plane message might still lead 
 		to temporary pause in data consumption. Refer section d for mitigation. This is identified 
 		by the log message,
 			- ERROR s.a.k.l.ShardConsumerSubscriber [NONE] - shardId-000000000035: Last request was dispatched at 2019-08-21T01:16:22.340Z, but no response as of 2019-08-21T01:16:57.902Z (PT35.562S).  Cancelling subscription, and restarting.
 	d. 	Mitigation for undelivered messages: The undelivered messages are primarily due to 
 		reduced SchedulerThreadPoolExecutor capacity. The customers are expected to assess 
 		the state of the SchedulerThreadPoolExecutor using the following diagnostic log messages 
 		and take appropriate actions like reducing the RecordProcessor.processRecords' Time or 
 		scaling out the application or increase the number of threads in the executor.
 			i. 	ExecutorStateEvent INFO log emitted every 30 seconds to check if activeThreads and 
 				queueSize are consistently high.
 					- INFO s.a.k.c.DiagnosticEventLogger - Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=100, coreThreads=0, leasesOwned=40, largestPoolSize=2, maximumPoolSize=2147483647)
 			ii.	RejectedTaskEvent ERROR log emitted when SchedulerThreadPoolExecutor fails to 
 				execute any event
 					- ERROR s.a.k.c.DiagnosticEventLogger [NONE] - Review your thread configuration to prevent task rejections. Task rejections will slow down your application and some shards may stop processing. Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=0, coreThreads=0, leasesOwned=0, largestPoolSize=0, maximumPoolSize=2147483647) - io.reactivex.exceptions.UndeliverableException: java.util.concurrent.RejectedExecutionException: Test exception.

Case 2: If you are consuming KCL 2.x version (prior to KCL 2.2.2) and you did NOT modify 
or replace SchedulerThreadPoolExecutor and faced shard stuck situations or frequent 
intermittent pause in data consumption, please continue reading, else go to next case. 
These issues are potentially due to submitting more tasks to SchedulerThreadPoolExecutor, 
than it can handle, leading to delayed execution of submitted tasks.
	a. 	The common symptom of this situation is the following log message,
			- ERROR s.a.k.l.ShardConsumerSubscriber [NONE] - shardId-000000000077: Last request was dispatched at 2019-08-21T01:16:22.340Z, but no response as of 2019-08-21T01:16:57.902Z (PT35.562S).  Cancelling subscription, and restarting.
	b. 	This release has more diagnostic log messages to identify the issues around the 
		congested SchedulerThreadPoolExecutor
			i. 	FanOutRecordsPublisher WARN log indicating high time (over 11 seconds) taken 
				by SchedulerThreadPoolExecutor to deliver an event to ShardConsumer. 
				Ideally delivery time should be less than a second. If this is consistently 
				high, refer section c.
					- WARN  s.a.k.r.f.FanOutRecordsPublisher [NONE] - shardId-000000000077: Record delivery time to shard consumer is high at 14534 millis. Check the ExecutorStateEvent logs to see the state of the executor service. Also check if the RecordProcessor's processing time is high.
			ii. FanOutRecordsPublisher DEBUG log to check the current event delivery time 
				by SchedulerThreadPoolExecutor. Ideally this should be less than a second. 
				If this is consistently high, refer section c.
					- DEBUG  s.a.k.r.f.FanOutRecordsPublisher [NONE] - shardId-000000000077: Record delivery time to shard consumer is at 401 millis
			iii.ExecutorStateEvent INFO log emitted every 30 seconds to check if activeThreads 
				is consistently high. activeThreads is considered high if it is more than 2X 
				the number of worker leases. If this is consistently high, refer section c.
 					- INFO s.a.k.c.DiagnosticEventLogger - Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=450, coreThreads=0, leasesOwned=64, largestPoolSize=520, maximumPoolSize=2147483647)
	c. 	The customers are expected to assess the state of the SchedulerThreadPoolExecutor using 
		the above diagnostic log messages and take appropriate mitigations like reducing the 
		RecordProcessor.processRecords' Time or scaling out the application. 
		

Case 3. All customers of KCL 2.x, prior to 2.2.2 release, were in a blind spot to throttling or 
any other exception from Cloudwatch metrics publish calls. Now these exceptions are made visible 
and we expect customers to take appropriate actions like increasing the Cloudwatch Put API TPS 
to fix the throttling issue or increasing the concurrent connections of the cloudwatch client 
to fix the limited connections issue.
	a. 	Increasing the concurrency of client - CloudWatchAsyncClient.builder().region(region).httpClientBuilder(NettyNioAsyncHttpClient.builder().maxConcurrency(Integer.MAX_VALUE)).build();
	b. 	Cloudwatch limit increase - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch_limits.html

pfifer added enhancement v2.x Issues related to the 2.x version bug labels Sep 17, 2018

sahilpalvia mentioned this issue Sep 18, 2018

Graceful handling of ReadTimeoutExceptions #403

Merged

sahilpalvia added the fix in progress label Sep 18, 2018

sahilpalvia added this to the v2.0.3 milestone Sep 19, 2018

pfifer closed this as completed Oct 8, 2018

jsohan mentioned this issue Nov 13, 2019

If the record processor takes more than 30 seconds to complete processRecords The KCL will automatically reconnect and start processing data again #651

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots of warnings #399

Lots of warnings #399

emlyn commented Sep 14, 2018

geovaneferreira commented Sep 17, 2018

pfifer commented Sep 17, 2018

pfifer commented Oct 8, 2018

majidfn commented Nov 16, 2018 •

edited

Loading

saeedansari commented Feb 8, 2019 •

edited

Loading

clocklear commented Feb 21, 2019

eoliphan commented Aug 9, 2019

ashwing commented Aug 22, 2019

Lots of warnings #399

Lots of warnings #399

Comments

emlyn commented Sep 14, 2018

geovaneferreira commented Sep 17, 2018

pfifer commented Sep 17, 2018

pfifer commented Oct 8, 2018

majidfn commented Nov 16, 2018 • edited Loading

saeedansari commented Feb 8, 2019 • edited Loading

clocklear commented Feb 21, 2019

eoliphan commented Aug 9, 2019

ashwing commented Aug 22, 2019

majidfn commented Nov 16, 2018 •

edited

Loading

saeedansari commented Feb 8, 2019 •

edited

Loading