Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Scheduler] Add duration checker #4984

Merged
merged 9 commits into from
Jan 18, 2021

Conversation

style95
Copy link
Member

@style95 style95 commented Sep 24, 2020

Description

This is a subsequent PR of #4983, once #4983 is merged, I would rebase this again.

Major changes are:

This is to add a duration checker for ElasticSearch.
With a new scheduler, it is important to decide when and how many containers to add.
The scheduler will calculate the average duration for the recent N activations and compute the processing power of one container, e.g. how many activations can be handled by one container in a given time. Factoring in the average duration, the number of incoming activations, and the number of activations in a queue, the scheduler can add more containers to handle the given activations.

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

}
}

trait DurationCheckerProvider extends Spi {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is based on the SPI.


actionMetaData.binding match {
case Some(binding) =>
client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Some and None cases can call a helper function since the only difference in the query is the List to match on and pass that List as a param.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly


import scala.concurrent.Future

object NoopDurationCheckerProvider extends DurationCheckerProvider {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this means you operate the schedule without using the average activation duration for an action heuristic. How important is using the heuristic for the performance of the scheduler?

Copy link
Member Author

@style95 style95 Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for other DBs such as CouchDB or CosmosDB in case the scheduler is used with other than ES.
(Even if it is highly recommended to use with ES.)

Regarding the average duration, it is important to improve the accuracy of calculation but the queue can still work without it. When an action is newly created, there is no activation accordingly no average duration.
In such a case, it assumes one container can handle one activation for the given time.
So even if one container can handle multiple activations for a given period, it assumes a container can handle only one activation, so schedulers would tend to overprovision containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of couchdb or cosmosdb, there is no average activation duration calculation since it uses this correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
If required, anyone can create it as it is based on SPI.

And one thing I forgot to tell you is, after this duration checker is landed, we introduced one more optimization.
Initially, the average duration was always calculated based on this module, but now(in our downstream), this is only used when a queue is newly created. After then, the queue uses the duration passed from containers.
As per POEM2, each container autonomously pulls an activation by sending a fetch-request. So when they send the fetch-request, we added one more field lastDuration. So the queue can keep the recent N duration in the circular queue and calculate the average duration without any external API call.

But when a new queue is created or an action is newly created, there is no data in the circular queue and the duration checker is used in such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay cool so if I understand what you just said correctly once the invoker is running, we still will get to use the average activation duration heuristic since we track it in memory. The elasticsearch spi is just for startup. That's good to know the optimization you described sounds more important we'll still get the benefits of tracking activation duration

@bdoyle0182
Copy link
Contributor

LGTM

@codecov-io
Copy link

codecov-io commented Nov 19, 2020

Codecov Report

Merging #4984 (307527e) into master (2d0c8a7) will increase coverage by 46.80%.
The diff coverage is 77.27%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4984       +/-   ##
===========================================
+ Coverage   29.09%   75.89%   +46.80%     
===========================================
  Files         195      206       +11     
  Lines        9553    10122      +569     
  Branches      413      450       +37     
===========================================
+ Hits         2779     7682     +4903     
+ Misses       6774     2440     -4334     
Impacted Files Coverage Δ
...isk/core/scheduler/queue/NoopDurationChecker.scala 0.00% <0.00%> (ø)
...scheduler/queue/ElasticSearchDurationChecker.scala 78.84% <78.84%> (ø)
.../scala/org/apache/openwhisk/core/WhiskConfig.scala 95.30% <100.00%> (+8.13%) ⬆️
...e/elasticsearch/ElasticSearchActivationStore.scala 84.66% <100.00%> (+84.66%) ⬆️
...e/openwhisk/core/containerpool/ContainerPool.scala 89.51% <0.00%> (-2.02%) ⬇️
.../core/monitoring/metrics/PrometheusEventsApi.scala 90.90% <0.00%> (ø)
...nwhisk/core/monitoring/metrics/KamonRecorder.scala 82.45% <0.00%> (ø)
...nwhisk/core/monitoring/metrics/EventConsumer.scala 89.23% <0.00%> (ø)
...rg/apache/openwhisk/core/scheduler/Scheduler.scala 0.00% <0.00%> (ø)
...penwhisk/core/monitoring/metrics/MetricNames.scala 100.00% <0.00%> (ø)
... and 150 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d0c8a7...307527e. Read the comment docs.

@style95
Copy link
Member Author

style95 commented Nov 21, 2020

It seems sometimes the build is failed even if it passed all tests because it failed to upload the file.

All checks passed.
Compressing logs dir...
tar: Removing leading `/' from member names
.
.
.
Uploading to Box...
Posting result <Response [500]>
{
  "error": "File upload failed."
}

Is this to upload the log files?
If yes, where can I refer to?

@style95
Copy link
Member Author

style95 commented Nov 22, 2020

It keeps failing.

Uploading to Box...
Posting result <Response [500]>
{
  "error": "File upload failed."
}

@style95
Copy link
Member Author

style95 commented Nov 22, 2020

It seems this endpoint no longer accepts any file.

$ curl -XPOST http://DamCYhF8.mybluemix.net/upload?name=system -H 'Content-Type: application/gzip' -F "file1=@log-upload-test.zip" -v
Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 169.62.254.79...
* TCP_NODELAY set
* Connected to DamCYhF8.mybluemix.net (169.62.254.79) port 80 (#0)
> POST /upload?name=system HTTP/1.1
> Host: DamCYhF8.mybluemix.net
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 538
> Expect: 100-continue
> Content-Type: application/gzip; boundary=------------------------5207d7f4ef7c0247
>
< HTTP/1.1 100 Continue
< X-Note: Gateway Ack
< HTTP/1.1 500 Internal Server Error
< X-Backside-Transport: FAIL FAIL
< Connection: Keep-Alive
< Transfer-Encoding: chunked
< Content-Type: application/json
< Date: Sun, 22 Nov 2020 12:48:34 GMT
< X-Global-Transaction-ID: cb47d0745fba5e1fd6a964c3
* HTTP error before end of send, stop sending
<
{
  "error": "File upload failed."
}

Could this be because the disk space is fully used?

@dgrove-oss @rabbah
Do you have any idea about this?
I have no idea to whom the endpoint belongs.

@dgrove-oss
Copy link
Member

We failed to upload the logs to Box. This happens because we don't have a way to automatically remove old logs and the Box folder fills up. However, the upload to Box is optional (will not cause the travisci job to fail). There is a real test failure earlier in the log: https://travis-ci.org/github/apache/openwhisk/jobs/744844186#L7457

@rabbah
Copy link
Member

rabbah commented Nov 26, 2020 via email

@style95
Copy link
Member Author

style95 commented Jan 18, 2021

Finally, it has passed all tests cases :)

@style95 style95 merged commit a6ad9e4 into apache:master Jan 18, 2021
falkzoll added a commit to falkzoll/runtime-nodejs that referenced this pull request Jan 19, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
falkzoll added a commit to falkzoll/runtime-python that referenced this pull request Jan 19, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
falkzoll added a commit to falkzoll/runtime-swift that referenced this pull request Jan 19, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
somaya10 pushed a commit to ibm-functions/runtime-nodejs that referenced this pull request Jan 20, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
somaya10 pushed a commit to ibm-functions/runtime-python that referenced this pull request Jan 20, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
somaya10 pushed a commit to ibm-functions/runtime-swift that referenced this pull request Jan 20, 2021
- The scheduler PR apache/openwhisk#4984 introduced
  changes that required adaption in the setup of the openwhisk environment used for the automated tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants