Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor nodes orchestration to rely on StatefulSets #1463

Merged
merged 43 commits into from
Aug 7, 2019

Conversation

sebgl
Copy link
Contributor

@sebgl sebgl commented Aug 2, 2019

Merge the statefulset-refactoring feature branch into master.

Change the way Elasticsearch nodes are orchestrated to rely on StatefulSets instead of managing pods.

The implementation is not complete yet. In particular:

  • it ignores any change budget considerations
  • if a StatefulSet needs to be replaced, the entire new sset will be added (with all replicas) before the old one is shut down (budget ignored)
  • it does not support grow and shrink, but does rolling upgrades by default
  • it does not remove PVCs, which can be a problem when creating/deleting multiple times the same stack: data volumes from the first stack will be reused for the second one
  • it probably has some orchestration bugs :)
  • it is missing a bit of refactoring and unit testing

However, it seems to be good enough to spawn stacks, apply any mutation, and run E2E tests.

Relates #1299.

nkvoll and others added 25 commits July 5, 2019 16:15
This facilitates a move to StatefulSets where the mounted secrets must be the same between
all the Pods in the same StatefulSet
Store transport keys and certificates in a single shared secret.
…1205)

Build a set of StatefulSets matching the ES CRD NodeSpec, and reconcile them instead of reconciling pods.

This is a very first iteration.

Bootstrapping an ES cluster seems to work fine. Upgrading this cluster (and everything related to zen1/zen2) does not.

Main bits of this change:

- Require the name field in the NodeSpec, to derive the StatefulSet name
- Build a StatefulSet per NodeSpec, along with a headless service
- Label the StatefulSet with a hash of its template, for comparison purpose
- Create a single Config secret per StatefulSet (instead of per pod)
- Label the pod template with a hash of the config (before zen2 initial master nodes), to rotate pods on config changes
- Reconcile StatefulSet, service and config in the default driver, commenting out most of the existing pod reconciliation logic

The code is left in a rather dirty state: I had to make a few changes deep down in functions that build pod spec and config. These changes bubble up pretty high, to changes calculations and the default driver.
Most of the default driver code and the mutation package is probably doomed, I chose to keep it there commented instead of deleting it for now, since it I think it may help in the short-term refactoring ("where does that logic live in the new code? here? OK I can remove."). Let me know if it still deserves more cleanup.

To be refactored and improved in follow-up commits.
Add support for scaling a StatefulSet up (just update sset replicas) or
down (migrate data away then remove nodes when ready). Which also adds
support for renaming a StatefulSet, by creating the new sset, then
slowly remove the existing one with data migration.

It ignores any zen1/zen2 consideration for now (works fine with zen2 in most
cases).
It ignores any change budget consideration for now.
Unit tests and E2E tests are missing, but a refactoring is to be
expected to handle zen2/zen2 and the changeBudget. I'd prefer waiting
for this refactoring to happen before dealing with large unit testing.
Consider this as still work in progress.

The PR also modifies the way pods and ssets are watched by the Elasticsearch controller.
* Support for APM server configuration (#1181)

* Add a config section to the APM server configuration

* APM: Add support for keystore

* Factorize ElasticsearchAuthSettings

* Update dev setup doc + fix GKE bootstrap script (#1203)

* Update dev setup doc + fix GKE bootstrap script

* Update wording of container registry authentication

* Ensure disks removal after removing cluster in GKE (#1163)

* Update gke-cluster.sh

* Implement cleanup for unused disks in GCP

* Update Makefile

* Update CI jobs to do proper cleanup

* Normalize the raw config when creating canonical configs (#1208)

This aims at counteracting the difference between JSON centric serialization and the use of YAML as the serialization format in canonical config. If not normalizing numeric values
like 1 will differ when comparing configs as JSON deserializes integer numbers to float64 and YAML to uint64.

* Homogenize logs (#1168)

* Don't run tests if only docs are changed (#1216)

* Update Jenkinsfile

* Simplify notOnlyDocs()

* Update Jenkinsfile

* Push snapshot ECK release on successful PR build (#1184)

* Update makefile's to support snapshots

* Add snapshot releases to Jenkins pipelines

* Cleanup

* Rename RELEASE to USE_ELASTIC_DOCKER_REGISTRY

* Update Jenkinsfile

* Add a note on EKS inbound traffic & validating webhook (#1211)

EKS users must explicitly enable communication from the k8s control
plane and nodes port 443 in order for the control plane to reach the
validating webhook.

Should help with #896.

* Update PodSpec with Hostname from PVC when re-using (#1204)

* Bind the Debug HTTP server to localhost by default (#1220)

* Run e2e tests against custom Docker image (#1135)

* Add implementation

* Update makefile's

* Update Makefile

* Rename Jenkisnfile

* Fix review comments

* Update e2e-custom.yml

* Update e2e-custom.yml

* Return deploy-all-in-one to normal

* Delete GKE cluster only if changes not in docs (#1223)

* Add operator version to resources (#1224)

* Warn if unsupported distribution (#1228)

The operator only works with the official ES distributions to enable the security
available with the basic (free), gold and platinum licenses in order to ensure that
all clusters launched are secured by default.

A check is done in the prepare-fs script by looking at the existence of the
Elastic License. If not present, the script exit with a custom exit code.

Then the ES reconcilation loop sends an event of type warning if it detects that
a prepare-fs init container terminated with this exit code.

* Document Elasticsearch update strategy change budget & groups (#1210)

Add documentation for the `updateStrategy` section of the Elasticsearch
spec.

It documents how (and why) `changeBudget` and `groups` are used by ECK,
and how both settings can be specified by the user.
Handle cluster rolling upgrades with StatefulSets.

Relies on manually updating the rollingUpdate.Partition index of a sset when we're ready to restart the ES node.
For now, it only handles 1 node restart at a time, mostly because we rely on cluster health green checks.
It also introduces 2 additional concepts:

StatefulSet updates expectations, based on UID + Generation: there are certain ES calls we don't want to make if our StatefulSet cache is out of date
"lazy" ES client calls: some ES calls we do to handle the upgrade cannot rely on out-of-date information from observers. We need fresh ES information on the state of the cluster before eg. re-enabling shard allocations. However we don't want to perform these calls at every single reconciliation. I see this as a "stop-gap" implementation before we go on with a larger refactoring of sync/async ES calls.
This is still WIP since many things are still missing:

changeBudget considerations.
zen1/zen2 handling.
Unit tests, everywhere. I still think a large portion of this code may be refactored in the short-term to account for change budget & zen1/zen2. Keeping unit tests for a later PR.
E2E tests, kept for a later PR.
Merge master - with conflicts fixed
Merge master into statefulset-refactoring
* Use the setvmmaxmapcount initcontainer by default in E2E tests (#1300)

Let's keep our default defaults :)

The setting is disabled explicitly for E2E tests where we enable a
restricted security context.

* Add docs for plugins, custom configuration files and secure settings (#1298)

* Allow license secret webhook to fail (#1301)

Webhooks on core k8s objects are just too debilitating in case our
webhook service fails. This sets the failure policy for the secret
webhook to ignore to strike a balance between UX (immediate feedback)
and keeping the users k8s cluster in a working state. Also we have an
additional validation run on controller level so this does not allow
circumventing our validation logic.

* Revert "Use the setvmmaxmapcount initcontainer by default in E2E tests (#1300)" (#1302)

This reverts commit fff1526.
This commit is breaking our E2E tests chain, which deploy a
PodSecurityPolicy by default. Any privileged init container will not
work.

I'll open an issue for a longer-term fix to properly handle this.

* Update quickstart (#1307)

* Update the name of the secret for the elastic user
* Bump the Elastic Stack version from 7.1.0 to 7.2.0

* Change Kibana readiness endpoint to return a 200 OK (#1309)

The previous endpoint returned an http code 302. While this is fine for
Kubernetes, some derived systems like GCP LoadBalancers mimic the
container readiness check for their own readiness check. Except GCP
Loadbalancers only work with status 200.

It's not up to us to adapt GCP LoadBalancers to K8s, but this is a
fairly trivial fix.

* Fix pod_forwarder to support two part DNS names, adjust e2e http_client (#1297)

* Fix pod_forwarder to support two part DNS names, adjust e2e http_client url

* Revert removing .svc in e2e http_client

* [DOC] Resources management and volume claim template (#1252)

* Add resources and persistent volume templates documentation

* Ignore resources reconciled by older controllers (#1286)

* Document PodDisruptionBudget section of the ES spec (#1306)

* Document PodDisruptionBudget section of the ES spec

I suspect this might slightly change in the feature depending on how we
handle the readiness check, so I'm keeping this doc minimal for now:

* what is a PDB, briefly (with a link)
* default PDB we apply
* how to set a different PDB
* how to disable the default PDB

* Move version out from Makefile (#1312)

* Add release note generation tool (#1314)

* no external dependencies
* inspects PRs by version label
* generates structured release notes in asciidoc grouped by type label

* Add console output to standalone apm sample (#1321)

* Update Quickstart to 0.9.0 (#1317)

* Update doc (#1319)

* Update persistent storage section
* Update kibana localhost url to use https
* Update k8s resources names in accessing-services doc
* Mention SSL browser warning
* Fix bulleted list

* Add CI job for nightly builds (#1248)

* Move version to a file

* Add CI implementation

* Update VERSION

* Depend on another PR for moving out version from Makefile

* Update Jenkinsfile

* Don't build and push operator image in bootstrap-gke (#1332)

We don't need to do that anymore, since we don't use an init container
based on the operator image.

* Remove Docker image publishing from devops-ci (#1339)

* Suppress output of certain commands from Makefile (#1342)

* Document how to disable TLS (#1341)

* Use new credentials for Docker registry (#1346)

* Workaround controller-runtime webhook upsert bug (#1337)

* Fix docs build on PR job (#1351)

* Fix docs build on PR job

* Cleanup workspace before doing other steps

* APM: remove "output" element and add elasticsearchRef (#1345)

* Don't rely on buggy metaObject Kind (#1324)

* Don't rely on buggy metaObject Kind

A bug in our client implementation may clear the object's Kind on
certain scenarios. See
kubernetes-sigs/controller-runtime#406.

Let's avoid that by fixing a constant Kind returned by a method call on
the resource.
Merge master into statefulset-refactoring
Change the way we setup zen1 and zen2 settings to adapt to
StatefulSets upscale and downscale.
Also improve StatefulSet downscale code, which is moved to its own file,
and add a few helper functions around pods and master nodes.

In a nutshell:

set minimum_master_nodes (zen1) and initial_master_nodes (zen2) in the configuration of new nodes to create, corresponding to the expected nodes that will be created. The setting is automagically updated in existing nodes configuration , where it is ignored until nodes restart.
update minimum_master_nodes (zen1) and voting_config_exclusions (zen2) whenever we remove a node.
For now, we don't (yet) do any change on rolling upgrades, which are limited to a single node at a time at the moment. This only impacts sset scale up/down.
* Add mandatory NodeSpec name

* Rename apm samples resources to something unique

Otherwise, volumes of resources from other tests are reused for this
one. Should be solved once we remove PVCs in
#1288.
Refactor the main driver and ES pod creation logic, towards simplification of the codebase.

Main changes:

StatefulSet and PodSpec creation logic lives in the NodeSpec package
No more catch-all version/common.go. The only difference between versions for now is explicitly zen1 vs. zen2.
No more pod package (with conflicting variable/package name)
No more nested indirection from driver -> defaultDriver -> NewPodSpecParams -> podSpec(func, func, func, func) -> buildPodTemplateSpec(func, func, func, func).
Default driver code attempts to be minimalistic (while still being a sequential list of function calls)
No more "generic resources" since that was only for the external service anyway
Remove deprecated code in mutation, pvc, pod package (a lot of the comparison/CalculateChanges/PVC reuse/PodWithConfig/PodSpecContext mechanism we had before)


* Move supported version check from driver to version pkg

* Simplify driver creation

* Extract driver nodes reconciliation to its own file

* Move expectations to the common pkg and add unit tests

* Move configmap reconciliation to its own file

* Move DynamicEnvVars to defaults pkg and use it in more places

* Extract es env vars away from the version pkg

* Move keystore init container params away from the main driver

* Move sset build logic into the nodespec pkg

* Move readiness probe in the nodespec pkg

* Move some es pod consts into the nodespec pkg

* Refactor pod template build

* Remove concept of generic resources since used for a single service

* Remove deprecated pod creation + PVC reuse code

* Remove most of the mutation package (RIP)

* Remove the PVC package (RIP)

* Remove PodWithConfig and PodSpecContext (yay!)

* Get rid of the pod pkg (yay!)

* Fix supported versions initialization

* Add missing license header

* Remove deprecated pv reuse e2e test

* Remove a bunch of unused functions

* Make linter happy with appendAssign

* Fix downward env defaults in apm unit tests

* Move PodsByName to the utils pkg

* Fix typo

* Fix reference to the keystore initcontainer and add a unit test

* Remove dead code

* Address PR review
@sebgl sebgl mentioned this pull request Aug 2, 2019
18 tasks
@sebgl sebgl added >enhancement Enhancement of existing functionality >refactoring labels Aug 2, 2019
@artemnikitin
Copy link
Member

@elasticmachine, run elasticsearch-ci/docs

sebgl added 4 commits August 5, 2019 09:43
To cleanup resources and avoid any PVC reuse for different E2E stacks,
let's remove PVCs at the end of E2E tests.
This should probably be done by the operator directly, see
#1288.
@sebgl
Copy link
Contributor Author

sebgl commented Aug 5, 2019

@anyasabo I think the samples close to the code (config/samples) are updated already. Are you referring to examples in our documentation?

Copy link
Contributor

@david-kow david-kow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments and questions.

@sebgl sebgl merged commit ab310df into master Aug 7, 2019
@pebrc pebrc removed >enhancement Enhancement of existing functionality >refactoring labels Oct 9, 2019
@sebgl sebgl deleted the statefulset-refactoring branch February 12, 2020 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants