Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MillisBehindLatest metric across _all_ shards #249

Open
usrenmae opened this issue Oct 19, 2017 · 21 comments
Open

MillisBehindLatest metric across _all_ shards #249

usrenmae opened this issue Oct 19, 2017 · 21 comments

Comments

@usrenmae
Copy link

Currently several metrics, including MillisBehindLatest are reported to CloudWatch with one of the dimensions being a shard id. On the other side we find it very convenient to set CloudWatch alarms on top of this metric to be able to react, if any shard starts to lag behind. Now it is not possible to set up alerts without specifying the exact name of the shard. This is a limiting factor, because once you add and remove shards constantly, the shard names are being very dynamic and each time they change, you need to change the alarms accordingly, which is frustrating. In general as one want to react to any shard lagging behind, it would be very nice to have a global MillisBehindLatest without relating it to any shard in its dimensions. This can be the maximum across all shards, like MaxMillisBehindLatest.

@sahilpalvia
Copy link
Contributor

Kinesis does emit a Stream level metrics for iterator age, called GetRecords.IteratorAgeMillis. You should be able to setup alarm on that metric. That metric can be found under the Kinesis namespace in CloudWatch. If you set the statistic for that metric to Maximum it'll map the maximum millisBehindLatest from all the shards for that given period. Please feel free to reopen the issue, if you still have questions.

@usrenmae
Copy link
Author

Thanks for informing about the GetRecords.IteratorAgeMilliseconds metric. I wasn't aware of this one. After a closer look into it I figured out it's a global per-stream metric of the Kinesis service. What I'm interested in is a per-consumer metric. We have multiple consumers running on the same stream, some of them may catch up the event feed perfectly, but others may lag. My idea was to have a metric which can tell you which particular consumer is lagging behind. It's not possible to get this information out of the GetRecords.IteratorAgeMilliseconds metric of Kinesis stream itself, but KCL could provide this metric similar way it provides the MillisBehindLatest, but without the shardId dimension.
Actually it is not convenient at all to have automation built around any shard-specific metrics, as shards are very dynamic on their own and may change in time, considering the fact that it is not possible to have an alarm on a metric with dimensions, but not specifying the dimension value. When monitoring is build on per-consumer basis, it's much more useful: one can setup permanent alarms on it and only in case of incident it's possible to trace back the particular shard with the shard-specific metrics already.
Please re-opening the issue as suggested above.

@sahilpalvia
Copy link
Contributor

Thank you for the feedback. We agree with the change you have suggested, and will prioritize it accordingly against the other customer requests we receive.

@sahilpalvia sahilpalvia reopened this Oct 20, 2017
@StevenYCChou
Copy link

@sahilpalvia I also have same use case which we want to scale up/down based on how fast KCL application consumes. this metric will be helpful.

@ghost
Copy link

ghost commented Mar 15, 2018

We have a similar use case and would like this metric as well. We have two kcl consumers on the same kinesis stream. One has a low threshold requirement while the other has a much higher threshold of latency.

We've set the alarm at the lower threshold on the stream, but it alarms once or twice a day because of the higher latency kcl consumer. We have to treat it as an alarm situation each time which obviously causes a lot of time wasted.

We've considered using the shard level metric, however being on the limit of max alarms allowed and having a 60 shard stream, that is not possible currently.

@akumariiit
Copy link

@sahilpalvia we also have exact same use case, can you provide any update on this?

@pfifer
Copy link
Contributor

pfifer commented Oct 8, 2018

We don't have an update at this time. This is a feature we are interested adding, and will prioritize it with all customer requests.

For all of those interested can you please post a reaction on the parent post, this will assist us in prioritizing customer requests.

@waffleshop
Copy link

+1

2 similar comments
@vinujan59
Copy link

+1

@vik7
Copy link

vik7 commented Oct 9, 2018

+1

@akumariiit
Copy link

+1

1 similar comment
@rkass
Copy link

rkass commented Nov 28, 2018

+1

@winty56
Copy link

winty56 commented Mar 21, 2019

+1
We have more than 500 shards in Kinesis and more than 4 KCL application using same Kinesis. In AWS Cloudwatch console, we can not search all shard because Console search result limit is 500. so we do not use KCL Metrics. Although the number of indicators we can graph at one time is limited to 100 in console. This feature is essential for me to check lag of each KCL Application.

@kaisermario
Copy link

+1

@kaisermario
Copy link

@pfifer Any update?

@MeisterMasi
Copy link

+1

1 similar comment
@CCBow-501
Copy link

+1

@yasemin-amzn
Copy link

Hello,

There are service side metrics emitted for monitoring stream-level behind-ness. For consumers using GetRecords, "GetRecords.IteratorAgeMilliseconds" metric will be emitted and all consumer applications will be contributing to this metric. Consumer applications using enhanced fanout will be emitting "SubscribeToShardEvent.MillisBehindLatest" metric along with the consumer name, so status of each consumer can be monitored individually.

Consider using these metrics as an alternative to client-side metrics for monitoring application health.

For more details please refer to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

@kaisermario
Copy link

Hello @yasemin-amzn ,
"SubscribeToShardEvent.MillisBehindLatest" is a basic (stream level) metric according to: https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html

Stream-level data is sent automatically every minute at no charge.

Unfortunately we can't see this metric in our account.

@leifbladt
Copy link

+1

@QwertV2
Copy link

QwertV2 commented Nov 30, 2021

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests