Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326

Merged

Conversation

bdoyle0182
Copy link
Contributor

Description

You can read a more detailed series of events to hit this case in the corresponding issue, but here's the tldr:

  1. Container doesn't have activations so transitions to pause container.
  2. Container times out once paused and is ready to be deleted.
  3. In order to delete once paused, a check is required for the count of containers to determine whether it should delete.
  4. The etcd request fails and the failed future is piped to the fsm. The paused state doesn't handle this message type so it stashes it until a state transition and the container proxy will sit in this corrupted state until a new activation is received.
  5. New activation is received and the container is attempted to be unpaused and the fsm transitions back to Running while it waits for the unpause future to complete.
  6. When the fsm transitions, it unstashes the failed future message from 4. which is handled in the Running state and goes to destroy the container.
  7. The container is destroyed, but the unpause future from 5. succeeds which has a side effect to rewrite the container key to etcd and this key is now orphaned forever since the container was actually destroyed.
  8. The scheduler sees the container from the watch endpoint it's listening to and now the queue for the action is stuck thinking one container exists forever that actually doesn't exist. If activations are infrequent enough that only one container is needed by the scheduling decision maker, then this action can never be run unless the system is restarted.

I've reproduced this case in my test environment many times and this change now handles everything gracefully. And added a unit test to simulate this case to verify the container proxy is gracefully torn down after a failed request to etcd.

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes:

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@@ -158,7 +158,7 @@ class ShardingContainerPoolBalancer(
AkkaManagement(actorSystem).start()
ClusterBootstrap(actorSystem).start()
Some(Cluster(actorSystem))
} else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) {
} else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a missed scalafmt

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.36%. Comparing base (a1639f0) to head (880ee61).
Report is 86 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5326      +/-   ##
==========================================
- Coverage   81.31%   76.36%   -4.96%     
==========================================
  Files         239      239              
  Lines       14249    14259      +10     
  Branches      602      575      -27     
==========================================
- Hits        11587    10889     -698     
- Misses       2662     3370     +708     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!
LGTM with a minor nit.

logging.error(
this,
s"Failed to determine whether to keep or remove container on pause timeout for ${data.container.containerId}, retrying. Caused by: $t")
startSingleTimer(DetermineKeepContainer.toString, DetermineKeepContainer, 1.second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to start the timer after 1 second?
Since it will delay the deletion of the ETCD key for the problematic container, another request heading to this container can still come during that time.
The request will be rescheduled, but it will anyway also delay container creation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think that shouldn't be the case. If a request comes in for the container, it will receive the Initialize message to unpause the container and go back to running. The retry message for DetermineKeepContainer after one second is cancelled at the beginning of the Initialize event along with the other timers so it should just gracefully unpause and go back to running so the latency is the duration of a normal unpause operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if for whatever reason a new request comes in and it can't unpause the container because the container is now broken for whatever reason, the activation will get rescheduled so there would be latency for that case which is already possible on any broken paused container but it's not any additional latency from the 1 second retry and the failure case of unpausing will be graceful now to correctly delete the broken container in all cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I got confused and got another question.
If there is a problem in a connection with ETCD so getLiveContainerCount keeps failing, then isn't this container proxy run forever until the ETCD connection is restored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's correct that's the current behavior I've set up. I guess in theory you could remove the container if the request fails, but you'd then break the guarantee of the keepingWarm count for the namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about introducing a maximum retry value?
After retrying x times to get the container count, if it fails, we can remove the container.
I feel it would be meaningless to keep the problematic containers for the keepingWarm count if the issue persists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just changed it to retry a max of 5 times and added a test that the container deletion still occurs gracefully if the retries are exhausted. Been running this code for a few days and haven't had an issue so 100% sure now this fixes the issue. If it looks good to you now, I'll merge in the morning.

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bdoyle0182 bdoyle0182 merged commit 625c5f2 into apache:master Sep 23, 2022
msciabarra pushed a commit to nuvolaris/openwhisk that referenced this pull request Nov 23, 2022
…pache#5326)

* fix orphaned container edge case in proxy paused state

* enhance test

* feedback

Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants