Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to Akka 2.6.12 #5065

Merged
merged 1 commit into from
Jun 4, 2021
Merged

Upgrade to Akka 2.6.12 #5065

merged 1 commit into from
Jun 4, 2021

Conversation

vrann
Copy link
Contributor

@vrann vrann commented Feb 13, 2021

Implement #5060

For any new features which uses Akka actors it would makes sense to start using Behavior-based construction of actors with the Akka Typed. This is the way to construct actors encouraged by Akka for any new functionality. In order to use Akka Typed, it should be supported by the Akka version used in OpenWhisk. At the same time no changes needed to the Akka Classic actors. Upgrading to the newest Akka version would allow to use all latest features of the Akka, while keeping it compatible with almost everything of the Akka 2.5.x that is been used in OpenWhisk so far.

Description

  • got rid of implicit custom ActorMaterializer
  • upgraded to Akka Http 10.2.3
  • changed the way how Https set up when host name verification is disabled
  • changed RestartSink.withBackoff to accept RestartSettings
  • changed akka.actor.FSM.setTimer => akka.actor.FSM.startTimerAtFixedRate
  • changed akka.actor.Scheduler.schedule => akka.actor.Scheduler.scheduleAtFixedRate
  • updated to the new signature of Source.actorRef
  • replaced deprecated HttpResponse.copy
  • replaced Gzip with Coders.Gzip

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@vrann vrann changed the title Ow 5060 Upgrade to Akka 2.6.11 Feb 13, 2021
@vrann vrann closed this Feb 16, 2021
@vrann vrann reopened this Feb 16, 2021
@vrann vrann closed this Feb 17, 2021
@vrann vrann reopened this Feb 17, 2021
@rabbah
Copy link
Member

rabbah commented Feb 17, 2021

This is failing due to scalafmt error:

Execution failed for task ':tests:checkTestScalafmt'.

> Files incorrectly formatted: /home/travis/build/apache/openwhisk/tests/src/test/scala/common/rest/WskRestOperations.scala,/home/travis/build/apache/openwhisk/tests/src/test/scala/org/apache/openwhisk/common/PrometheusTests.scala,/home/travis/build/apache/openwhisk/tests/src/test/scala/org/apache/openwhisk/test/http/RESTProxy.scala,/home/travis/build/apache/openwhisk/tests/src/test/scala/org/apache/openwhisk/core/containerpool/logging/test/DockerToActivationFileLogStoreTests.scala,/home/travis/build/apache/openwhisk/tests/src/test/scala/org/apache/openwhisk/core/database/ArtifactWithFileStorageActivationStoreTests.scala,/home/travis/build/apache/openwhisk/tests/src/test/scala/org/apache/openwhisk/core/database/test/AttachmentSupportTests.scala

@bdoyle0182
Copy link
Contributor

bdoyle0182 commented Feb 19, 2021

Thanks for the contribution. I have a couple of thoughts / experiences with akka 2.6. We eventually undoubtedly need to upgrade this project to akka 2.6.

  1. There are a lot of hidden breaking changes or changes in behavior when upgrading live clusters. For example akka http needs an intermediary upgrade to a specific patch version of 2.6.x before you can upgrade to the latest 2.6.x on a live cluster.
  2. This is less for this mr, but in akka 2.6 the untyped fsm is deprectated and will be removed after the 2.6 branch. So much of this project is built on the untyped fsm. We're going to have to make a dedicated effort at some point to migrate the fsm's to typed.
  3. Naver is in the process of open sourcing the new scheduler, which is a major rearchitecting of the project. These changes need to be grandfathered in to untyped / we should wait to upgrade to 2.6 until we finish the scheduler.

From my own team's experience with our own services, upgrading to akka 2.6 ended up being quite the headache that was a month's long process of learning little nuances and changes in behavior. I can try to collect all of those things and communicate that knowledge with the community, hopefully that can make the process smooth. The upgrade was valuable with the new features and improvements akka 2.6 offers even with the problems we faced.

We need to do this upgrade at some point to keep up with akka, but there is going to need to be more community involvement and discussion on how we want to make this happen rather than just merging the one mr.

@bdoyle0182
Copy link
Contributor

Oh also akka 2.6.11 has a critical memory leak in akka streams so it's unusable.

@vrann
Copy link
Contributor Author

vrann commented Feb 20, 2021

@bdoyle0182 Thanks for the feedback!
I would definitely support more community discussion around the upgrade. I'm new to this community and with this PR wanted to put something which can help to bring the discussion forward. What do you see can be the next steps?

From our side, we are working on the new feature built with akka typed, which we would like eventually contribute back to this project as well. So right now there are few choices, either a) to downgrade it to akka classic, or b) to upgrade openwhisk to akka 2.6 or c) to keep our own fork. I can tell that the option b) is my favorite and I would be glad to help to move it forward as far as I can.

Regarding your comments:

  1. It would be very valuable to learn the problems with upgrade you faced. That should probably be part of the documentation which goes with this PR?
  2. I was planning to volunteer some time to an effort of FSM upgrade. If there is an interest in community for this work I can collaborate. I believe it is as valuable as just the akka upgrade.
  3. To the best of my knowledge, everything written with the classic akka would still work with the akka typed. Would it be logistically easier to keep these two activities separate and then upgrade scheduler on typed at some point? Are there any other changes in scheduler which relies on 2.5? Or is it just a matter in which order we merge these two changes?
  4. Regarding 2.6.11 noted. I will start upgrading to 2.6.12

@ddragosd
Copy link
Contributor

@bdoyle0182 thanks for the insights you provided regarding the next Akka version. I wasn’t aware of all these details and I’m glad you pointed them out !

To the 3rd point we’re also planning to contribute an add-on to the existing system that improves Web Actions invocation considerably. This PR is a prerequisite for that.

3.a. Is there a reference to the work you mentioned that we could include in this conversation ?

3.b. I’d love to find a way to not block other contributions from the community. The more we iterate on ideas, and the faster we test them out, the faster we’ll learn how to improve the current solution.

@vrann vrann closed this Feb 22, 2021
@vrann vrann reopened this Feb 22, 2021
@codecov-io
Copy link

codecov-io commented Feb 22, 2021

Codecov Report

Merging #5065 (eb23c2a) into master (1753946) will decrease coverage by 7.63%.
The diff coverage is 69.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5065      +/-   ##
==========================================
- Coverage   82.80%   75.17%   -7.64%     
==========================================
  Files         207      211       +4     
  Lines       10034    10211     +177     
  Branches      444      435       -9     
==========================================
- Hits         8309     7676     -633     
- Misses       1725     2535     +810     
Impacted Files Coverage Δ
.../containerpool/logging/ElasticSearchLogStore.scala 92.85% <ø> (-0.25%) ⬇️
...sk/core/containerpool/logging/SplunkLogStore.scala 81.81% <0.00%> (-3.37%) ⬇️
...ache/openwhisk/core/database/ActivationStore.scala 92.30% <ø> (ø)
...penwhisk/core/database/ArtifactStoreProvider.scala 66.66% <ø> (ø)
...he/openwhisk/core/database/AttachmentSupport.scala 100.00% <ø> (ø)
...che/openwhisk/core/database/CouchDbRestStore.scala 71.21% <ø> (ø)
...ore/database/azblob/AzureBlobAttachmentStore.scala 11.53% <ø> (-60.58%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <ø> (-95.85%) ⬇️
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 7.69% <ø> (-50.00%) ⬇️
...e/elasticsearch/ElasticSearchActivationStore.scala 84.66% <ø> (ø)
... and 73 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1753946...eb23c2a. Read the comment docs.

@vrann vrann closed this Mar 3, 2021
@vrann vrann reopened this Mar 3, 2021
@vrann vrann closed this Mar 3, 2021
@vrann vrann reopened this Mar 3, 2021
Copy link
Member

@rabbah rabbah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for taking this on.
Couple of small nits/questions.
If this is getting stuck in travis ping me I'll check the failed stage.

val engine = context.createSSLEngine(host, port)
engine.setUseClientMode(true)
// WARNING: this creates an SSL Engine without enabling endpoint identification/verification procedures
// Disabling host name verification is a very bad idea, please don't unless you have a very good reason to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment why it is ok here, given the warning in the comment? ie when is withDisableHostnameVerification set and ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment and the implementation is taken from the akka 2.6.x documentation on how to disable host name verification for ssl. This is what openwhisk was doing in couple of places including tests, but having the old SSLConfig class deprecated it had to be re-implemented. So I figured it, I'd rather keep comment to discourage using this approach anywhere in production.

Do you suggest adding more explanation on why is it needed to the code? Sure, I can do that.

Copy link
Member

@rabbah rabbah Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - thanks for the added clarification. I think this was/is needed for testing with self-signed certs.

@vrann
Copy link
Contributor Author

vrann commented Mar 4, 2021

@rabbah Thank you for the review! I do have problems with Travis, right now there are couple of tests failing randomly. One in the standalone test suite, one in system tests and one in the unit tests. I'm not sure how common is it for this build, it doesn't look like the reason of the failures in the PR. But if it was stable before then definitely worth investigating more.

@vrann vrann changed the title Upgrade to Akka 2.6.11 Upgrade to Akka 2.6.12 Mar 4, 2021
@vrann
Copy link
Contributor Author

vrann commented Mar 4, 2021

@bdoyle0182 @rabbah I have few more questions regarding the upgrade.

  1. Upgrade to 2.6.x brings also the usage of artery for the networking. Per akka documentation, when upgrading live systems, if you can't stop the cluster, upgrade should happen in two stages: first -- upgrade to 2.6 with Artery disabled and 2nd -- enable artery. I think this is what you @bdoyle0182 also mentioned? How do we want to proceed with it in this PR? I can see one option is to have first PR with artery disabled released in the minor version and then another PR which enables artery to release in patch version. In this way you can perform two-step upgrade on your end. But looking into release history of the openwhisk I'm not sure what is the general strategy towards releases, would it be really something community would use.
  2. I have strange issue with the Java Flight Recorder. It complains for the missing class in classpath -- NoClassDefFoundError: jdk.jfr.Event. I disabled flight recording in the akka and now it works, but would appreciate any comment on the better fix. I understand that JFR being proprietary tool might not exist on my image, but why would it then be expected to be present in the classpath by akka?

@vrann vrann marked this pull request as ready for review March 9, 2021 15:38
@vrann vrann force-pushed the ow-5060 branch 3 times, most recently from 5abf9d6 to 8efcab3 Compare April 20, 2021 12:05
@vrann vrann closed this Apr 20, 2021
@vrann vrann reopened this Apr 20, 2021
@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2021

Codecov Report

Merging #5065 (923af49) into master (4ec5d96) will increase coverage by 34.95%.
The diff coverage is 72.04%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5065       +/-   ##
===========================================
+ Coverage   41.02%   75.97%   +34.95%     
===========================================
  Files         226      226               
  Lines       11470    11463        -7     
  Branches      491      496        +5     
===========================================
+ Hits         4705     8709     +4004     
+ Misses       6765     2754     -4011     
Impacted Files Coverage Δ
.../containerpool/logging/ElasticSearchLogStore.scala 92.85% <ø> (+92.85%) ⬆️
...sk/core/containerpool/logging/SplunkLogStore.scala 81.81% <0.00%> (+81.81%) ⬆️
...ache/openwhisk/core/database/ActivationStore.scala 92.30% <ø> (ø)
...penwhisk/core/database/ArtifactStoreProvider.scala 66.66% <ø> (ø)
...he/openwhisk/core/database/AttachmentSupport.scala 100.00% <ø> (+26.31%) ⬆️
...che/openwhisk/core/database/CouchDbRestStore.scala 71.21% <ø> (+18.68%) ⬆️
...ore/database/azblob/AzureBlobAttachmentStore.scala 11.53% <ø> (+11.53%) ⬆️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <ø> (ø)
...abase/cosmosdb/CosmosDBArtifactStoreProvider.scala 7.69% <ø> (+7.69%) ⬆️
...e/elasticsearch/ElasticSearchActivationStore.scala 84.66% <ø> (+23.92%) ⬆️
... and 181 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4ec5d96...923af49. Read the comment docs.

@vrann vrann closed this Apr 21, 2021
@vrann vrann reopened this Apr 21, 2021
@vrann vrann closed this Apr 23, 2021
@vrann vrann reopened this Apr 23, 2021
@bdoyle0182
Copy link
Contributor

@vrann

  1. If you're upgrading to akka 2.6, you first need to upgrade to akka 2.6.2-2.6.4 to perform a rolling restart upgrade for cluster remoting serialization otherwise you need to do a new cluster to go directly to 2.6.5 or later. This is independent of the artery change from classic remoting. The artery stuff shouldn't be blocking we can upgrade that once we're on 2.6. You can then upgrade to latest patch version from there. Explained here:
    "Rolling updates are possible without shutting down all nodes of the Akka Cluster, but will require configuration adjustments as described in the Remoting section of this migration guide. Due to the changed serialization of the Cluster messages in Akka 2.6.2 a rolling update from 2.5.x must first be made to Akka 2.6.2 and then a second rolling update can change to Akka 2.6.3 or later."
    https://doc.akka.io/docs/akka/current/project/rolling-update.html#2-6-2-clustermessageserializer-manifests-change
  2. However, there is an infinite loop tls bug in earlier versions of akka 2.6 that would crash our service. So we should have a commit for 2.6.2-2.6.4 and a commit with latest so people can immediately upgrade to latest once they've done the rolling restart to 2.6.2.
  3. There's a new automatic passivation feature that runs by default that has caused us trouble where an fsm is stopped if it doesn't receive a message after some time. This caused us some problems so we had to update the config so the timeout was large enough it would never passivate. However, I think openwhisk does not use cluster sharding anywhere for its fsm's so I don't think we have to worry about this

@vrann vrann force-pushed the ow-5060 branch 3 times, most recently from e2584e5 to b611036 Compare May 28, 2021 15:08
@vrann vrann closed this May 28, 2021
@vrann vrann reopened this May 28, 2021
- removed usage implicit custom ActorMaterializer
- upgraded to Akka Http 10.2.3
- changed the way how Https set up when host name verification is disabled
- changed RestartSink.withBackoff to accept RestartSettings
- changed akka.actor.FSM.setTimer => akka.actor.FSM.startTimerAtFixedRate
- changed akka.actor.Scheduler.schedule => akka.actor.Scheduler.scheduleAtFixedRate
- updated to the new signature of Source.actorRef
- replaced deprecated HttpResponse.copy
- replaced Gzip with Coders.Gzip
- fixed the scalafmt in tests
- removed the implicit Materializer shutdown considering its lifecycle is managed by akka
- increased the ansible wait time
- switched to classic networking
- disabled JFR
- merging with New Scheduler
- increased wait time for FSM due to intermittent failures in build
@rabbah
Copy link
Member

rabbah commented Jun 3, 2021

@bdoyle0182 Are you suggesting a version of this PR with akka version 2.6.2 instead and a subsequent PR that bumps the version to 2.6.12?

@ddragosd
Copy link
Contributor

ddragosd commented Jun 3, 2021

@rabbah I think the suggestion is dependent on the way OpenWhisk is being operated / upgraded. I'm worried that if we wait longer, we're gonna have more conflicts to fix in this PR, depending on changes in master.

@rabbah
Copy link
Member

rabbah commented Jun 3, 2021

Thanks Dragos - I agree. I think one approach for operators is to change the version in the operator's fork or we can create a branch that downgrade the akka version. I don't expect that we'll cut a release with the transition version and then another with the new version.

I reviewed the PR again since it was a while already since I last looked at these changes. I don't have any reservations.

@ddragosd
Copy link
Contributor

ddragosd commented Jun 4, 2021

@rabbah thanks for reviewing this PR again !

@bdoyle0182 I'm merging this with the assumption that if you need to transition through intermediate Akka versions you can branch this, change the Akka version, rebuild, deploy, and repeat until you reach the Akka version in this PR. Please keep us posted on how this goes and if you encounter any issues along the way.

@ddragosd ddragosd merged commit 7791e4b into apache:master Jun 4, 2021
@ddragosd ddragosd mentioned this pull request Jun 4, 2021
@ningyougang ningyougang mentioned this pull request Jun 7, 2021
22 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants