Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Scheduler] Initial commit for the scheduler component #4983

Merged
merged 5 commits into from
Nov 11, 2020

Conversation

style95
Copy link
Member

@style95 style95 commented Sep 24, 2020

Description

This is the first change to add the new scheduler component.
A series of subsequent PRs would be opened by me and my team members.
We would start from modules with fewer dependencies to modules that are highly dependent on the others.

We will add the "scheduler" label and [New scheduler] prefix in the title of PRs.

Please refer to the issue for more information and discussion history.
You may find this useful as well: new scheduler design
JFYI, this scheduler is running in production in Naver.

There are many "TODO"s in this PR.
As we agreed to incrementally merge PRs along with temporal commits, I left many parts of the original codes as "TODO".
I wanted to give some hints to reviewers about how it would be working.

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.


private val etcdWorkerFactory = "" // TODO: TBD

val dataManagementService = "" // TODO: TBD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component is in charge of storing data to ETCD.


val creationJobManagerFactory = "" // TODO: TBD

val containerManager = "" // TODO: TBD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component is responsible for creating containers for a given action.
It relies on the creationJobManager to manage the container creation job.


val containerManager = "" // TODO: TBD

val memoryQueueFactory = "" // TODO: TBD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a factory to create memory queues.
In the new architecture, each action is given its own dedicated queue.


implicit val trasnid = TransactionId.containerCreation

val queueManager = "" // TODO: TBD
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the major components which take charge of managing queues and coordinating requests among the scheduler, controllers, and invokers.

Map(
servicePort -> 8080.toString,
schedulerHost -> null,
schedulerAkkaPort -> null,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler has two ports, one for akka-remote and the other for akka-grpc.


Seq(
("scheduler" + instanceId.asString, "actions", Some(ActivationEntityLimit.MAX_ACTIVATION_LIMIT)),
("creationAck" + instanceId.asString, "creationAck", Some(ActivationEntityLimit.MAX_ACTIVATION_LIMIT)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final goal is to remove Kafka from the critical path, but it still relies on Kafka as of now.
Now activation messages are sent to the scheduler via schedulerN topic and container creation messages are sent to invoker via invokerN topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controller is posting to schedulerN and scheduler is posting to invokerN correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@style95 style95 mentioned this pull request Sep 24, 2020
21 tasks
@style95 style95 changed the title [New scheduler] Initial commit for the scheduler component [New Scheduler] Initial commit for the scheduler component Sep 25, 2020
@bdoyle0182
Copy link
Contributor

From what I understand so far, LGTM for an initial commit. Can always come back and add comments if I learn something later

Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @bdoyle0182 for the review as well.
@style95 a lot of the comments you're adding in the PR would make good comments in the code.

@@ -426,3 +426,16 @@ object EventMessage extends DefaultJsonProtocol {

def parse(msg: String) = Try(format.read(msg.parseJson))
}

object StatusQuery
case class StatusData(invocationNamespace: String, fqn: String, waitingActivation: Int, status: String, data: String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is waiting activation?
what is status / why is it a string?
what is data?
are these basic (string & int) types for performance reasons?

Some comments or examples would be helpful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to query the queue status from the scheduler.

The following is an example.
When a MemoryQueue(AkkaFSM) is running, there can be a different combination of status and data.
This would be useful to figure out the status when any issue happens in a queue or scheduler.

[
...
  {
    "data": "RunningData",
    "fqn": "whisk.system/elasticsearch/status-alarm@0.0.2",
    "invocationNamespace": "style95",
    "status": "Running",
    "waitingActivation": 1
  }
...
]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waitingActivation is the number of waiting activations in a queue.
Generally, this value would be a small number or zero as most activation messages would be properly handled in time.
But if there is any issue; containers are not provisioned in time, any disconnection with other components happens, etc, there can be many activations waiting for processing.

Regarding the basic types, we just used the string as it is simple.
If we need to change it to certain types with proper serde, how about adding them when we add a feature to use this data class?

@style95
Copy link
Member Author

style95 commented Oct 12, 2020

It seems there are some issues in the travis job.

@style95 style95 closed this Oct 12, 2020
@style95 style95 reopened this Oct 12, 2020
@codecov-io
Copy link

codecov-io commented Oct 12, 2020

Codecov Report

Merging #4983 into master will decrease coverage by 6.85%.
The diff coverage is 46.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4983      +/-   ##
==========================================
- Coverage   83.70%   76.85%   -6.86%     
==========================================
  Files         202      202              
  Lines        9808     9823      +15     
  Branches      413      420       +7     
==========================================
- Hits         8210     7549     -661     
- Misses       1598     2274     +676     
Impacted Files Coverage Δ
.../org/apache/openwhisk/core/connector/Message.scala 78.12% <0.00%> (-0.45%) ⬇️
.../org/apache/openwhisk/core/entity/InstanceId.scala 80.70% <0.00%> (-9.50%) ⬇️
...la/org/apache/openwhisk/common/TransactionId.scala 95.18% <100.00%> (+0.05%) ⬆️
.../scala/org/apache/openwhisk/core/WhiskConfig.scala 95.27% <100.00%> (+0.19%) ⬆️
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-95.85%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0.00% <0.00%> (-93.90%) ⬇️
...tabase/cosmosdb/cache/CacheInvalidatorConfig.scala 0.00% <0.00%> (-92.31%) ⬇️
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9134a03...4bc1f21. Read the comment docs.

@style95
Copy link
Member Author

style95 commented Oct 14, 2020

Let's postpone merging this PR until we release the OpenWhisk core 1.0

@style95 style95 merged commit 7b99af9 into apache:master Nov 11, 2020
@style95
Copy link
Member Author

style95 commented Nov 11, 2020

I merged this as the core 1.0.0 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants