-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326
Fix Orphaned Container Edge Case In Paused State of Container Proxy #5326
Conversation
@@ -158,7 +158,7 @@ class ShardingContainerPoolBalancer( | |||
AkkaManagement(actorSystem).start() | |||
ClusterBootstrap(actorSystem).start() | |||
Some(Cluster(actorSystem)) | |||
} else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) { | |||
} else if (loadConfigOrThrow[Seq[String]]("akka.cluster.seed-nodes").nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a missed scalafmt
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5326 +/- ##
==========================================
- Coverage 81.31% 76.36% -4.96%
==========================================
Files 239 239
Lines 14249 14259 +10
Branches 602 575 -27
==========================================
- Hits 11587 10889 -698
- Misses 2662 3370 +708 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
LGTM with a minor nit.
logging.error( | ||
this, | ||
s"Failed to determine whether to keep or remove container on pause timeout for ${data.container.containerId}, retrying. Caused by: $t") | ||
startSingleTimer(DetermineKeepContainer.toString, DetermineKeepContainer, 1.second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to start the timer after 1 second?
Since it will delay the deletion of the ETCD key for the problematic container, another request heading to this container can still come during that time.
The request will be rescheduled, but it will anyway also delay container creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think that shouldn't be the case. If a request comes in for the container, it will receive the Initialize
message to unpause the container and go back to running. The retry message for DetermineKeepContainer
after one second is cancelled at the beginning of the Initialize
event along with the other timers so it should just gracefully unpause and go back to running so the latency is the duration of a normal unpause operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if for whatever reason a new request comes in and it can't unpause the container because the container is now broken for whatever reason, the activation will get rescheduled so there would be latency for that case which is already possible on any broken paused container but it's not any additional latency from the 1 second retry and the failure case of unpausing will be graceful now to correctly delete the broken container in all cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I got confused and got another question.
If there is a problem in a connection with ETCD so getLiveContainerCount
keeps failing, then isn't this container proxy run forever until the ETCD connection is restored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's correct that's the current behavior I've set up. I guess in theory you could remove the container if the request fails, but you'd then break the guarantee of the keepingWarm count for the namespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how about introducing a maximum retry value?
After retrying x times to get the container count, if it fails, we can remove the container.
I feel it would be meaningless to keep the problematic containers for the keepingWarm count if the issue persists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just changed it to retry a max of 5 times and added a test that the container deletion still occurs gracefully if the retries are exhausted. Been running this code for a few days and haven't had an issue so 100% sure now this fixes the issue. If it looks good to you now, I'll merge in the morning.
...ala/org/apache/openwhisk/core/containerpool/v2/test/FunctionPullingContainerProxyTests.scala
Show resolved
Hide resolved
...ala/org/apache/openwhisk/core/containerpool/v2/test/FunctionPullingContainerProxyTests.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…pache#5326) * fix orphaned container edge case in proxy paused state * enhance test * feedback Co-authored-by: Brendan Doyle <brendand@qualtrics.com>
Description
You can read a more detailed series of events to hit this case in the corresponding issue, but here's the tldr:
I've reproduced this case in my test environment many times and this change now handles everything gracefully. And added a unit test to simulate this case to verify the container proxy is gracefully torn down after a failed request to etcd.
Related issue and scope
My changes affect the following components
Types of changes:
Checklist: