Limited threads resiliency fix durability nonblock #573

ashwing · 2019-07-18T23:24:55Z

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…scriber and FanoutPublisher. The SDK Threads are made to block wait on the ack from the ShardConsumerSubscriber

yasemin-amzn · 2019-07-22T18:40:49Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsRetrieved.java

+     *
+     * @return sequenceNumber to checkpoint
+     */
+    default String batchSequenceNumber() {


Let's reuse the same term from the API (i.e. continuationSequenceNumber), unless you have specific reason to introduce a new name here.

We dont use the term continuationSequenceNumber for polling API. For S2S it is the contSeqNum explicitly set and for Get it is the last successfully processed SeqNum in that record batch. Hence I wanted this to be a seqNum that uniquely identifies the record batch that was processed.

As far as I see, these Ifaces are not integrated with Get path. The use of this attribute is continuationSequenceNumber, or checkpointSequenceNumber.

the new batchSeqNum naming seems to be just another way of saying continuationSequenceNumber, and we should reuse same terms, instead of introducing new ones.

Was referring to the following RecordsRetrieved Impl.

@Accessors(fluent = true) @Data static class PrefetchRecordsRetrieved implements RecordsRetrieved { final ProcessRecordsInput processRecordsInput; final String lastBatchSequenceNumber; final String shardIterator; PrefetchRecordsRetrieved prepareForPublish() { return new PrefetchRecordsRetrieved(processRecordsInput.toBuilder().cacheExitTime(Instant.now()).build(), lastBatchSequenceNumber, shardIterator); } }

Looking into it again, it makes sense to keep it as continuationSequenceNumber as RecordsRetrieved might still have this as a property for pagination.

yasemin-amzn · 2019-07-22T18:42:23Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsRetrievedAck.java

+     * Sequence Number of the record batch that was delivered to the Subscriber/Observer.
+     * @return deliveredSequenceNumber
+     */
+    String deliveredSequenceNumber();


This also seems to be the checkpoint sequence number.

From the Acker's perspective this is just the seqnum that was successfully delivered. Using it for checkpointing is upto the caller.

That comment makes me think that all we need is to send the UUID back with the Ack. The batch details are already in the queue, the purpose of the Ack is to match with the UUID.

Ya removing the seqnum from ack

yasemin-amzn · 2019-07-22T18:43:54Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            } catch (Throwable t) {
+                errorOccurred(triggeringFlowFuture.getNow(null), t);
+            }
+            // TODO : debug log on isNextEventScheduled


Let's take care of TODOs. Except logging purposes, isNextEventScheduled variable doesn't seem to be used. Can the logging taken care of within evictAckedEventAndScheduleNextEvent implementation?

isNextEventScheduled was earlier used, but I dont see a purpose of returning. Will remove it.

yasemin-amzn · 2019-07-22T21:30:52Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            // TODO : toString implementation for recordsRetrievedAck
+            log.error("{}: KCL BUG: Found mismatched payload {} in the delivery queue for the ack {}  ", shardId,
+                    recordsRetrievedContext.getRecordsRetrieved(), recordsRetrievedAck);
+            throw new IllegalStateException("KCL BUG: Record delivery ack mismatch");


What happens upon this exception? Does the queue get reset, and new subscription started?

We should make sure application is not stuck, and retries to make progress.

This will trigger errorOccured, everything will be reset and health check would take care of initializing the new subscription.

yasemin-amzn · 2019-07-22T21:36:41Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+    }
+
+    @Override
+    public void notify(RecordsRetrievedAck recordsRetrievedAck) {


nitpick: This is an ack for record delivery to the ShardConsumerSubscriber. Should we rename? i.e. RecordDeliveryAck

yasemin-amzn · 2019-07-22T21:41:35Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            }
+            // TODO : debug log on isNextEventScheduled
+            final RecordFlow triggeringFlow = triggeringFlowFuture.getNow(null);
+            if (triggeringFlow != null) {


what's the case this will not be completed?

Looking at the evictAckedEventAndScheduleNextEvent implementation, completable either gets completed, or throws.

You are right. I just used completable future as a way to get the triggeringflow info from the evictAckedEventAndScheduleNextEvent. : ) We can return the triggeringflow info as part of the api return type as well.

Then let's keep this simple, and have evictAckedEventAndScheduleNextEvent return the RecordFlow. There is nothing async going on here that needs Future.

Also, should we keep the method name simpler -- i.e. handleNotify or handleNotification

Okay will have the record flow return, but I feel the method name can still be descriptive enough to convey its role in the ack mechanism.

yasemin-amzn · 2019-07-22T21:53:27Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+                    shardId, recordsDeliveryQueue.remainingCapacity());
+            throw e;
+        } catch (Throwable t) {
+            recordsDeliveryQueue.remove(recordsRetrievedContext);


why only removal of this event, is throwable accepted here? I expect that the flow gets restarted from the beginning upon exception. we should also log error on this throw block.

In the context of this method, it failed to schedule the event that was already queued. Hence we remove it from queue. But when we throw back the throwable, it will be caught by the errorOccured and the queue will be cleared.

We can decide on whether we need to log an error.

logging error here will be good. Probably no need to bother removal, exception should reset the flow.

yasemin-amzn · 2019-07-22T22:03:58Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            triggeringFlowFuture.complete(recordsRetrievedContext.getRecordFlow());
+            // Try scheduling the next event in the queue, if available.
+            if (recordsDeliveryQueue.peek() != null) {
+                subscriber.onNext(recordsDeliveryQueue.peek().getRecordsRetrieved());


this seems to introduce a race at record delivery. This method can find find the event in the queue as the next event, and deliver to Subscriber. But the event that is found could be the only event (added right after removal of the one upon the notification handled here), and another onNext would already be scheduled by the bufferCurrentEventAndScheduleIfRequired.

Hmmm bufferCurrentEventAndScheduleIfRequired would block wait on the lockObject until this is completed. After scheduling delivery by evictAckedEventAndScheduleNextEvent, the bufferCurrentEventAndScheduleIfRequired would acquire the lock, enqueue the next event, will find there are 2 events and will leave it to evictAckedEventAndScheduleNextEvent schedule the next event upon ack for previous event.

I think that's right, all existing paths seem to get a hold on the lock. However, I'm not sure if this diff is showing all the new code changes introduced in this PR. I left a separate comment on that. Let's sync up offline.

yasemin-amzn · 2019-07-25T22:13:43Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+    }
+
+    @Data
+    private static class RecordsRetrievedContext {


…f PrefetchPublisher

yasemin-amzn

Thank you Ashwin, looking good to me at the high level. Leaving minor comments.

yasemin-amzn · 2019-08-02T04:05:33Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/NotifyingSubscriber.java

+     * Return the publisher to be notified
+     * @return RecordsPublisher to be notified.
+     */
+    RecordsPublisher getWaitingRecordsPublisher();


Can "waiting" omitted here? getRecordsPublisher() sounds sufficient.

Yeah. sounds good

yasemin-amzn · 2019-08-02T04:06:53Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/NotifyingSubscriber.java

+     * @param recordsRetrieved for which we need the ack.
+     * @return RecordsRetrievedAck
+     */
+    RecordsDeliveryAck getRecordsRetrievedAck(RecordsRetrieved recordsRetrieved);


Let's keep inline with the return type: getRecordsDeliverAck?

yep. will change it

yasemin-amzn · 2019-08-02T04:29:24Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            }
+            // TODO : debug log on isNextEventScheduled
+            final RecordFlow triggeringFlow = triggeringFlowFuture.getNow(null);
+            if (triggeringFlow != null) {


Then let's keep this simple, and have evictAckedEventAndScheduleNextEvent return the RecordFlow. There is nothing async going on here that needs Future.

Also, should we keep the method name simpler -- i.e. handleNotify or handleNotification

yasemin-amzn · 2019-08-02T04:34:48Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            if (flow != null && recordsDeliveryAck.batchUniqueIdentifier().getFlowIdentifier()
+                    .equals(flow.getSubscribeToShardId())) {
+                log.error(
+                        "{}: KCL BUG: Publisher found mismatched ack for subscription {}  ",


logging KCL BUG is not necessarily helpful to customers. Instead, let's give more details on what happens under this condition. i.e. "Unexpected event received. Restarting subscription."

Yha we can add more context. Chose to have "KCL Bug" so that the customers can reachout to us in this case. anyways we can convey this using "Unexpected exception" as well

yasemin-amzn · 2019-08-02T04:36:50Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            }
+            // Otherwise publisher received a stale ack.
+            else {
+                log.info("{}: Publisher received duplicate ack or an ack for stale subscription {}. Ignoring.", shardId,


Can't this be an ack to for the next record in the queue? that would mean missing ack, and we should not ignore.

null flow would be the only case where ignore is fine.

We should not be getting ack for an event in queue, until the event before it in the queue received one.

That's right, if things work the way we expect them to work. We should plan for unexpected, and have safeguards in place.

yasemin-amzn · 2019-08-02T04:46:05Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+                    shardId, recordsDeliveryQueue.remainingCapacity());
+            throw e;
+        } catch (Throwable t) {
+            recordsDeliveryQueue.remove(recordsRetrievedContext);


logging error here will be good. Probably no need to bother removal, exception should reset the flow.

yasemin-amzn · 2019-08-02T05:06:11Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+    }
+
+    @VisibleForTesting
+    void setSubscriberForTesting(Subscriber<RecordsRetrieved> s) {


let's not make the class mutable for tests. same applies to setFlow.

why not call subscribe to set the subscriber?

It required to mock some other dependencies. Hence resorted to a simple workaround. Will check if this can be avoided.

micah-jaffe

Looks good to me, left a few minor comments. Two general comments:

Can we add a @KinesisClientInternalApi annotation to all the new classes in this PR? That gives us more freedom to modify them without impacting customers (without warning)
As this change adds resiliency to rejected tasks, can we update the messaging in https://github.com/awslabs/amazon-kinesis-client/blob/master/amazon-kinesis-client/src/main/java/software/amazon/kinesis/coordinator/RejectedTaskEvent.java#L29

micah-jaffe · 2019-08-14T20:26:50Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/common/DiagnosticUtils.java

+        if (durationBetweenEnqueueAndAckInMillis > MAX_TIME_BETWEEN_REQUEST_RESPONSE / 3) {
+            // The above condition logs the warn msg if the delivery time exceeds 11 seconds.
+            log.warn(
+                    "{}: Record delivery time to shard consumer is high at {} millis. Check the ExecutorStateEvent logs"


Minor: Can we specify what to look for in ExecutorStateEvent logs and what actions to take? Something maybe like "Check the ExecutorStateEvent logs to see if many threads are running concurrently. Consider using the default configuration."

Also can we specify where to check for RecordProcessor running time and also what to do if it's too high?

Hmmm there are different executor states which might lead to this situation. This can happen even with deault executor. I feel it is better we leave it to the customer to evaluate from the state information available.

micah-jaffe · 2019-08-14T20:35:33Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsDeliveryAck.java

+
+package software.amazon.kinesis.retrieval;
+
+public interface RecordsDeliveryAck {


Can we add a @FunctionalInterface annotation here to make the usage in ShardConsumerNotifyingSubscriber a bit clearer?

Also a javadoc comment on this class would be helpful

We might add more state information in future to this interface, which might need more than one abstract method. Added javadoc comment.

micah-jaffe · 2019-08-14T20:38:01Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsPublisher.java

+     * @param ack acknowledgement received from the subscriber.
+     */
+    default void notify(RecordsDeliveryAck ack) {
+        throw new UnsupportedOperationException("RecordsPublisher does not support acknowledgement from Subscriber");


For my understanding: why is the desired default behavior to throw an exception here?

For my understanding: why is the desired default behavior to throw an exception here?
I think this is because the logic for evictAckedEventAndScheduleNextEvent is implemented in FanOutRecordsPublisher, which extends this class. FanOutRecordsPublisher is the publisher that we are using and requires the Ack mechanism.

This is to inform the NotifyingSubscriber that the Publisher it is subscribing to, has not implemented notify() method. Rather allowing it to be a no-op, we throw exception back.

micah-jaffe · 2019-08-14T20:38:49Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsRetrieved.java

@@ -16,6 +16,8 @@

 import software.amazon.kinesis.lifecycle.events.ProcessRecordsInput;

+import java.util.UUID;


It doesn't look like you use this import in this file, maybe a remnant of earlier commit?

Thanks. Removed.

micah-jaffe · 2019-08-14T20:53:35Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+                    shardId, recordsDeliveryQueue.remainingCapacity());
+            throw e;
+        } catch (Throwable t) {
+            log.error("{}: Unable to deliver event to the shard consumer.", shardId, t);


Would it be helpful to add some info from the RecordsRetrievedContext to this log and the one above? Seems like when debugging we might want to know more than just the shardId

Do you have any use case where record context might be helpful? I feel these two exceptions will be thrown due to capacity constraint, rather than due to the record itself. Let me know if you feel otherwise.

micah-jaffe · 2019-08-14T20:56:04Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

                return;
            }
+
+            // Clear the delivery buffer so that next subscription don't yield duplicate records.
+            clearRecordsDeliveryQueue();


Confirming that clearRecordsDeliveryQueue is a blocking operation, right? We won't start fetching new records until the queue is empty?

Right. This would just empty the local queue.

micah-jaffe · 2019-08-14T20:59:35Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

@@ -489,11 +612,18 @@ public void onComplete() {

        private final ProcessRecordsInput processRecordsInput;
        private final String continuationSequenceNumber;
+        private final String flowIdentifier;
+        private final String batchUniqueIdentifier = UUID.randomUUID().toString();


Is there any risk of generating duplicate batchUniqueIdentifiers by using random instead of sequential generation? If this does happen will it break anything?

I liked this analogy from quora "If every person on the planet generates a new UUID4 every second, we’d expect a collision to occur after about 10 years." SRC : https://www.quora.com/Has-there-ever-been-a-UUID-collision

The only risk I can see is if two same UUIDs generated one after another. In this case a malicious subscriber sending more than one ack for the same recordbatch, will delete the next event in quque. But this is extremely unlikely to happen.

ychunxue

Overall lgtm. Just minor NIT and a question.

ychunxue · 2019-08-14T22:47:18Z

...-kinesis-client/src/main/java/software/amazon/kinesis/lifecycle/ShardConsumerSubscriber.java

@@ -77,7 +77,7 @@ void startSubscriptions() {
                recordsPublisher.restartFrom(lastAccepted);
            }
            Flowable.fromPublisher(recordsPublisher).subscribeOn(scheduler).observeOn(scheduler, true, bufferSize)
-                    .subscribe(this);
+                    .subscribe(new ShardConsumerNotifyingSubscriber(this, recordsPublisher));


Minor cooment: Maybe add a comment here to explain that the ShardConsumerNotifyingSubscriber is the new subscriber that we introduced the Ack mechanism.

This is not a new subscriber, but just a wrapper on top of the existing one.

ychunxue · 2019-08-14T22:54:34Z

...client/src/main/java/software/amazon/kinesis/retrieval/polling/PrefetchRecordsPublisher.java

 import software.amazon.kinesis.retrieval.RecordsPublisher;
 import software.amazon.kinesis.retrieval.RecordsRetrieved;
 import software.amazon.kinesis.retrieval.RetryableRetrievalException;
 import software.amazon.kinesis.retrieval.kpl.ExtendedSequenceNumber;

+import static software.amazon.kinesis.common.DiagnosticUtils.takeDelayedDeliveryActionIfRequired;


Minor NIT: Can we just import the class and use the static method in notify method?

We prefer to avoid wildcard imports.

ychunxue · 2019-08-14T23:27:33Z

...is-client/src/main/java/software/amazon/kinesis/retrieval/fanout/FanOutRecordsPublisher.java

+            // Note: This does not block wait to enqueue.
+            recordsDeliveryQueue.add(recordsRetrievedContext);
+            // If the current batch is the only element in the queue, then try scheduling the event delivery.
+            if (recordsDeliveryQueue.size() == 1) {


A question about this code path: So recordsReceived seems to be another code path that we receive the records without the ack? Are we still allowing the subscriber to schedule another record in this case?

The SDK thread dispatches the records to our Publisher in a blocking manner. That is an event(T) will be dispatched only when the SDK thread sees the Future for event(T-1) is completed successfully.

ychunxue · 2019-08-14T23:37:35Z

amazon-kinesis-client/src/main/java/software/amazon/kinesis/retrieval/RecordsPublisher.java

+     * @param ack acknowledgement received from the subscriber.
+     */
+    default void notify(RecordsDeliveryAck ack) {
+        throw new UnsupportedOperationException("RecordsPublisher does not support acknowledgement from Subscriber");


For my understanding: why is the desired default behavior to throw an exception here?
I think this is because the logic for evictAckedEventAndScheduleNextEvent is implemented in FanOutRecordsPublisher, which extends this class. FanOutRecordsPublisher is the publisher that we are using and requires the Ack mechanism.

micah-jaffe

Approved

girida-amazon added 4 commits July 8, 2019 16:52

Adding unit test case for record delivery validation

d1f188a

Initial prototype for notification mechanism between ShardConsumerSub…

9f7cec6

…scriber and FanoutPublisher. The SDK Threads are made to block wait on the ack from the ShardConsumerSubscriber

initial non blocking prototype

cd8307b

Refactoring src and test

ff51d64

yasemin-amzn reviewed Jul 22, 2019

View reviewed changes

yasemin-amzn reviewed Jul 25, 2019

View reviewed changes

girida-amazon added 3 commits August 1, 2019 01:05

Added unit test cases. Addressed review comments. Handled edge cases

27d9c5c

Minor code changes. Note that the previous commit has blocking impl o…

5dd1adf

…f PrefetchPublisher

Refactored the cleanup logic

6b20206

yasemin-amzn suggested changes Aug 2, 2019

View reviewed changes

girida-amazon added 3 commits August 8, 2019 12:26

Fix for Cloudwatch exception handling and other revioew comment fixes

a2e0104

Typo fix

ada1089

Removing cloudwatch fix. Will be released in a separate commit.

9ae0728

ashwing marked this pull request as ready for review August 13, 2019 17:45

micah-jaffe reviewed Aug 14, 2019

View reviewed changes

ychunxue approved these changes Aug 14, 2019

View reviewed changes

Changing RejectedTaskEvent log message for the release

4627b09

micah-jaffe approved these changes Aug 15, 2019

View reviewed changes

girida-amazon added 2 commits August 14, 2019 23:28

Added javadoc to RecordsDeliveryAck and optimized imports

e0d388f

Adding Kinesis Internal API tag for new concrete implementations

aaa3f8c

micah-jaffe approved these changes Aug 16, 2019

View reviewed changes

micah-jaffe merged commit 3f6afc6 into awslabs:master Aug 16, 2019

sahilpalvia added this to the v2.2.2 milestone Aug 16, 2019


		package software.amazon.kinesis.retrieval;

		public interface RecordsDeliveryAck {

		@@ -16,6 +16,8 @@

		import software.amazon.kinesis.lifecycle.events.ProcessRecordsInput;

		import java.util.UUID;

Limited threads resiliency fix durability nonblock #573

Limited threads resiliency fix durability nonblock #573

Conversation

ashwing commented Jul 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashwing Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yasemin-amzn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micah-jaffe left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashwing Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ychunxue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micah-jaffe left a comment

Choose a reason for hiding this comment

ashwing Jul 24, 2019 •

edited

Loading

micah-jaffe left a comment •

edited

Loading

ashwing Aug 15, 2019 •

edited

Loading