Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Scheduler] Implement FunctionPullingContainerPool #5102

Merged

Conversation

ningyougang
Copy link
Contributor

@ningyougang ningyougang commented Apr 14, 2021

Description

Manage container pool.

Design document: https://cwiki.apache.org/confluence/display/OPENWHISK/FunctionPullingContainerPool

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@ningyougang ningyougang changed the title [wip][scheduler] Implement FunctionPullingContainerPool [wip][New Scheduler] Implement FunctionPullingContainerPool Apr 14, 2021
import scala.util.{Random, Try}
import scala.collection.immutable.Queue

case class Creation(creationMessage: ContainerCreationMessage, action: WhiskAction)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case class Creation(creationMessage: ContainerCreationMessage, action: WhiskAction)
case class CreateContainer(creationMessage: ContainerCreationMessage, action: WhiskAction)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

import scala.collection.immutable.Queue

case class Creation(creationMessage: ContainerCreationMessage, action: WhiskAction)
case class Deletion(deletionMessage: ContainerDeletionMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case class Deletion(deletionMessage: ContainerDeletionMessage)
case class DeleteContainer(deletionMessage: ContainerDeletionMessage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly.

logging.warn(this, message)
sendAckToScheduler(create.rootSchedulerIndex, ack)
} else {
logging.info(this, s"received a container creation message: ${create.creationId}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think logs like this should be debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, normally, i should be debug.
But i think currently use info is better, after all prs of the scheduler are merged and become stable, i think we can submit a separate pr to change all relative log level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We leave info logs when an activation flows through the system.
Similarly, we can track the container creation flow with this kind of log.
I think we can keep this as info.

sendAckToScheduler(msg.rootSchedulerIndex, ack)
}

// if warmed containers is failed to resume, we should try to use other container or create a new one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we attempt to remove the container that failed to remove as well?

Copy link
Contributor Author

@ningyougang ningyougang Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, subsequent pr FunctionPullContainerProxy will remove the container firstly.

@ningyougang ningyougang changed the title [wip][New Scheduler] Implement FunctionPullingContainerPool [New Scheduler] Implement FunctionPullingContainerPool Apr 28, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2021

Codecov Report

Merging #5102 (7a2aca4) into master (aa7e6e2) will decrease coverage by 6.42%.
The diff coverage is 75.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5102      +/-   ##
==========================================
- Coverage   81.51%   75.08%   -6.43%     
==========================================
  Files         220      221       +1     
  Lines       10731    11178     +447     
  Branches      444      473      +29     
==========================================
- Hits         8747     8393     -354     
- Misses       1984     2785     +801     
Impacted Files Coverage Δ
...k/core/containerpool/v2/InvokerHealthManager.scala 75.37% <ø> (ø)
.../org/apache/openwhisk/core/connector/Message.scala 67.35% <36.36%> (-12.36%) ⬇️
...in/scala/org/apache/openwhisk/common/Logging.scala 77.11% <50.00%> (-8.82%) ⬇️
...ontainerpool/v2/FunctionPullingContainerPool.scala 80.53% <80.53%> (ø)
...penwhisk/core/containerpool/ContainerFactory.scala 88.88% <100.00%> (+1.38%) ⬆️
...ala/org/apache/openwhisk/core/entity/DocInfo.scala 95.12% <100.00%> (+1.18%) ⬆️
.../scala/org/apache/openwhisk/core/entity/Size.scala 79.10% <100.00%> (ø)
...e/openwhisk/core/containerpool/ContainerPool.scala 96.86% <100.00%> (+0.05%) ⬆️
...ntainerpool/v2/FunctionPullingContainerProxy.scala 14.28% <100.00%> (+14.28%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa7e6e2...7a2aca4. Read the comment docs.

@@ -91,7 +91,13 @@ class ContainerPool(childFactory: ActorRefFactory => ActorRef,
.nextInt(v.toSeconds.toInt))
.getOrElse(0)
.seconds
context.system.scheduler.schedule(2.seconds, interval, self, AdjustPrewarmedContainer)
if (prewarmConfig.exists(!_.reactive.isEmpty)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have no need to backfill the prewarm periodically if reactive configuration is not included in runtimes.json, because if doesn't exist reactive in runtimes.json, the prewarm container will not be expired forever.

fyi, the reactive configuration in runtimes.json like below

...
                "stemCells": [
                    {
                        "initialCount": 2,
                        "memory": "256 MB",
                        "reactive": {
                            "minCount": 1,
                            "maxCount": 4,
                            "ttl": "2 minutes",
                            "threshold": 1,
                            "increment": 1
                        }
                    }
                ]
...

if (expiredPrewarmedContainer.nonEmpty) {
// emit expired container counter metric with memory + kind
MetricEmitter.emitCounterMetric(LoggingMarkers.CONTAINER_POOL_PREWARM_EXPIRED(memory.toString, kind))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If expiredPrewarmedContainer is empty, have no need to emit this counter metric.


private var preWarmScheduler: Option[Cancellable] = None
private var prewarmConfigQueue = Queue.empty[(CodeExec[_], ByteSize, Option[FiniteDuration])]
private val prewarmCreateFailedCount = new AtomicInteger(0)
Copy link
Contributor Author

@ningyougang ningyougang Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for retry logic.

Let's assume the max retry limit is 5,

  • If 1th, 2th, 3th creation prewarm failed, but the 4th creation prewarm success, the prewarmCreateFailedCount would be reset to 0. (the count would be reinitialized when creation is succeeded at any time.)
  • If 1th, 2th, 3th, 4th, 5th creation prewarm failed, it would stop all creation. wait next round.

}

// Key is ColdStartKey, value is the number of cold Start in minute
var coldStartCount = immutable.Map.empty[ColdStartKey, Int]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This coldStartCount logic already existed in upstream, i just picked up relative logic to this pr


/** Install prewarm containers up to the configured requirements for each kind/memory combination or specified kind/memory */
private def adjustPrewarmedContainer(init: Boolean, scheduled: Boolean): Unit = {
if (!shuttingDown) {
Copy link
Contributor Author

@ningyougang ningyougang Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It the invoker is disabled, have no need to backfill the prewarm.

@ningyougang ningyougang removed the wip label May 6, 2021
case _ => false
}
val startingCount = prewarmStartingPool.count(p => p._2._1 == kind && p._2._2 == memory)
val queuingCount = prewarmQueue.count(p => p._1.kind == kind && p._2 == memory)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to avoid create a lot of prewarm in a very short time

@ningyougang
Copy link
Contributor Author

ningyougang commented May 6, 2021

@bdoyle0182 @style95 it is ready to review again now.

prewarm-expiration-check-interval-variance: 10 seconds # varies expiration across invokers to avoid many concurrent expirations
prewarm-expiration-limit: 100 # number of prewarms to expire in one expiration cycle (remaining expired will be considered for expiration in next cycle)
prewarm-max-retry-limit: 5 # max retry limit for create prewarm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, this max limit is reached when the subsequent 5 retries are failed.
How about changing this to max subsequent retry limit to create prewarm containers?

Would be worth mentioning that the count would be reinitialized when creation is succeeded at any time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already changed the comment to max subsequent retry limit to create prewarm containers

prewarm-expiration-check-interval-variance: 10 seconds # varies expiration across invokers to avoid many concurrent expirations
prewarm-expiration-limit: 100 # number of prewarms to expire in one expiration cycle (remaining expired will be considered for expiration in next cycle)
prewarm-max-retry-limit: 5 # max retry limit for create prewarm
prewarm-promotion: false # if true, action can take prewarm container which has bigger memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these following two configurations used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but for prewarm-promotion, is used in FunctionPullingContainerPool only, and the configuration is false which mean, didn't take prewarm container which has bigger memory

logging.warn(this, message)
sendAckToScheduler(create.rootSchedulerIndex, ack)
} else {
logging.info(this, s"received a container creation message: ${create.creationId}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We leave info logs when an activation flows through the system.
Similarly, we can track the container creation flow with this kind of log.
I think we can keep this as info.

@ningyougang ningyougang force-pushed the FunctionPullingContainerPool branch from 273e183 to 7a2aca4 Compare May 7, 2021 01:19
Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiangpengcheng jiangpengcheng merged commit 3802374 into apache:master May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants