Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

bdoyle0182 · 2022-09-20T23:31:29Z

Description

You can read a more detailed series of events to hit this case in the corresponding issue, but here's the tldr:

Container doesn't have activations so transitions to pause container.
Container times out once paused and is ready to be deleted.
In order to delete once paused, a check is required for the count of containers to determine whether it should delete.
The etcd request fails and the failed future is piped to the fsm. The paused state doesn't handle this message type so it stashes it until a state transition and the container proxy will sit in this corrupted state until a new activation is received.
New activation is received and the container is attempted to be unpaused and the fsm transitions back to Running while it waits for the unpause future to complete.
When the fsm transitions, it unstashes the failed future message from 4. which is handled in the Running state and goes to destroy the container.
The container is destroyed, but the unpause future from 5. succeeds which has a side effect to rewrite the container key to etcd and this key is now orphaned forever since the container was actually destroyed.
The scheduler sees the container from the watch endpoint it's listening to and now the queue for the action is stuck thinking one container exists forever that actually doesn't exist. If activations are infrequent enough that only one container is needed by the scheduling decision maker, then this action can never be run unless the system is restarted.

I've reproduced this case in my test environment many times and this change now handles everything gracefully. And added a unit test to simulate this case to verify the container proxy is gracefully torn down after a failed request to etcd.

Related issue and scope

I opened an issue to propose and discuss this change ([New Scheduler] Container unpausing results in key remaining in etcd for deleted container #5325)

My changes affect the following components

Types of changes:

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

bdoyle0182 · 2022-09-20T23:38:23Z

...er/src/main/scala/org/apache/openwhisk/core/loadBalancer/ShardingContainerPoolBalancer.scala

@@ -158,7 +158,7 @@ class ShardingContainerPoolBalancer(
 AkkaManagement(actorSystem).start()
 ClusterBootstrap(actorSystem).start()
 Some(Cluster(actorSystem))
- } else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) { 
+ } else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) {


I think this is just a missed scalafmt

codecov-commenter · 2022-09-21T04:24:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.36%. Comparing base (a1639f0) to head (880ee61).
Report is 86 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5326      +/-   ##
==========================================
- Coverage   81.31%   76.36%   -4.96%     
==========================================
  Files         239      239              
  Lines       14249    14259      +10     
  Branches      602      575      -27     
==========================================
- Hits        11587    10889     -698     
- Misses       2662     3370     +708

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

style95

Great work!
LGTM with a minor nit.

style95 · 2022-09-21T05:04:10Z

...rc/main/scala/org/apache/openwhisk/core/containerpool/v2/FunctionPullingContainerProxy.scala

+ logging.error(
+ this,
+ s"Failed to determine whether to keep or remove container on pause timeout for ${data.container.containerId}, retrying. Caused by: $t")
+ startSingleTimer(DetermineKeepContainer.toString, DetermineKeepContainer, 1.second)


Is there any reason to start the timer after 1 second?
Since it will delay the deletion of the ETCD key for the problematic container, another request heading to this container can still come during that time.
The request will be rescheduled, but it will anyway also delay container creation.

Hmm I think that shouldn't be the case. If a request comes in for the container, it will receive the Initialize message to unpause the container and go back to running. The retry message for DetermineKeepContainer after one second is cancelled at the beginning of the Initialize event along with the other timers so it should just gracefully unpause and go back to running so the latency is the duration of a normal unpause operation.

And if for whatever reason a new request comes in and it can't unpause the container because the container is now broken for whatever reason, the activation will get rescheduled so there would be latency for that case which is already possible on any broken paused container but it's not any additional latency from the 1 second retry and the failure case of unpausing will be graceful now to correctly delete the broken container in all cases

ok, I got confused and got another question.
If there is a problem in a connection with ETCD so getLiveContainerCount keeps failing, then isn't this container proxy run forever until the ETCD connection is restored?

yes that's correct that's the current behavior I've set up. I guess in theory you could remove the container if the request fails, but you'd then break the guarantee of the keepingWarm count for the namespace

Then how about introducing a maximum retry value?
After retrying x times to get the container count, if it fails, we can remove the container.
I feel it would be meaningless to keep the problematic containers for the keepingWarm count if the issue persists.

I just changed it to retry a max of 5 times and added a test that the container deletion still occurs gracefully if the retries are exhausted. Been running this code for a few days and haven't had an issue so 100% sure now this fixes the issue. If it looks good to you now, I'll merge in the morning.

...ala/org/apache/openwhisk/core/containerpool/v2/test/FunctionPullingContainerProxyTests.scala

style95

LGTM!

…pache#5326) * fix orphaned container edge case in proxy paused state * enhance test * feedback Co-authored-by: Brendan Doyle <brendand@qualtrics.com>

fix orphaned container edge case in proxy paused state

0282cab

bdoyle0182 requested a review from style95 September 20, 2022 23:36

bdoyle0182 commented Sep 20, 2022

View reviewed changes

style95 approved these changes Sep 21, 2022

View reviewed changes

Brendan Doyle added 2 commits September 21, 2022 12:24

enhance test

d9a7180

feedback

880ee61

style95 approved these changes Sep 23, 2022

View reviewed changes

bdoyle0182 merged commit 625c5f2 into apache:master Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

bdoyle0182 commented Sep 20, 2022

bdoyle0182 Sep 20, 2022

codecov-commenter commented Sep 21, 2022 •

edited

Loading

style95 left a comment

style95 Sep 21, 2022

bdoyle0182 Sep 21, 2022

bdoyle0182 Sep 21, 2022

style95 Sep 22, 2022

bdoyle0182 Sep 22, 2022

style95 Sep 22, 2022

bdoyle0182 Sep 23, 2022

style95 left a comment

Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

Conversation

bdoyle0182 commented Sep 20, 2022

Description

Related issue and scope

My changes affect the following components

Types of changes:

Checklist:

Choose a reason for hiding this comment

codecov-commenter commented Sep 21, 2022 • edited Loading

Codecov Report

style95 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

style95 left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 21, 2022 •

edited

Loading