Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NetworkUnavailable as node condition #543

Merged

Conversation

rewiko
Copy link
Contributor

@rewiko rewiko commented Oct 22, 2020

What this PR does / why we need it:

Machine-controller-manager should also check for the NetworkUnavailable nodeCondition, some CNI will update this condition depending on the state of the CNI or the network availability.

Which issue(s) this PR fixes:
Fixes ##469

Special notes for your reviewer:

Release note:

NetworkUnavailable nodeCondition added to the example, some CNI will update this condition depending on the state of the CNI or the network availability.
NetworkUnavailable node condition is also considered by default while considering the machine's to be unhealthy.

@rewiko rewiko requested review from ggaurav10 and a team as code owners October 22, 2020 15:35
@gardener-robot
Copy link

@rewiko Thank you for your contribution.

@CLAassistant
Copy link

CLAassistant commented Oct 22, 2020

CLA assistant check
All committers have signed the CLA.

@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Oct 22, 2020
@gardener-robot-ci-3
Copy link
Contributor

Thank you @rewiko for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

Copy link
Contributor

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes.

/ok-to-test

@gardener-robot gardener-robot added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/review Needs review labels Oct 22, 2020
@hardikdr hardikdr added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 23, 2020
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Oct 23, 2020
@prashanth26
Copy link
Contributor

Can you please fix the unit tests which are failing by running make test-unit?

Below are truncated test results.

coverage: 0.0% of statements
[1603453479] Machine Controller Suite - 97/97 specs E1023 11:45:57.626688   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
•••E1023 11:45:57.937542   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
••••••••E1023 11:45:58.860629   24279 drain.go:178] Drain Error: Cordoning of node failed with error: Failed to update node
••E1023 11:45:58.963529   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
••E1023 11:45:59.178833   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
••E1023 11:45:59.282346   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•••••••••••E1023 11:46:00.817126   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•••••••E1023 11:46:01.634524   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•E1023 11:46:01.737122   24279 drain.go:178] Drain Error: Cordoning of node failed with error: Failed to update node
•••••••••••E1023 11:46:02.964922   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•••E1023 11:46:03.275550   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
[BeforeEach] #isHealthy
  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:50
[It] with NodeNetworkUnavailable is Unknown
  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

------------------------------
• Failure [0.000 seconds]
machine #isHealthy Checking health of the machine [It] with NodeNetworkUnavailable is Unknown 
/tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

  Expected
      <bool>: false
  to be identical to
      <bool>: true

  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:99
------------------------------
•E1023 11:46:03.581375   24279 machine_util.go:898] Drain failed due to failure in update of node conditions: failed to create update conditions for node "fakeNode-0": Failed to update node
••E1023 11:46:03.683923   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
••••[BeforeEach] #isHealthy
  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:50
[It] with NodeNetworkUnavailable is True
  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

------------------------------
• Failure [0.000 seconds]
machine #isHealthy Checking health of the machine [It] with NodeNetworkUnavailable is True 
/tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/vendor/github.com/onsi/ginkgo/extensions/table/table_entry.go:46

  Expected
      <bool>: false
  to be identical to
      <bool>: true

  /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:99
------------------------------
••••E1023 11:46:04.704045   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•E1023 11:46:04.806615   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
•E1023 11:46:04.909206   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
•••••E1023 11:46:05.621702   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
••••••••••••••••E1023 11:46:07.553256   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
•••E1023 11:46:08.164128   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
••••••E1023 11:46:08.676796   24279 machine_safety.go:151] SafetyController: Unable to GET on machine objects APIServer is Not Reachable
E1023 11:46:08.779353   24279 machine_safety.go:159] SafetyController: Unable to GET on node objects APIServer is Not Reachable
••

Summarizing 2 Failures:

[Fail] machine #isHealthy Checking health of the machine [It] with NodeNetworkUnavailable is Unknown 
/tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:99

[Fail] machine #isHealthy Checking health of the machine [It] with NodeNetworkUnavailable is True 
/tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/pkg/util/provider/machinecontroller/machine_test.go:99

Ran 97 of 97 Specs in 11.458 seconds
FAIL! -- 95 Passed | 2 Failed | 0 Pending | 0 Skipped
--- FAIL: TestMachineControllerSuite (11.46s)
FAIL
coverage: 42.4% of statements
path is /tmp/build/a94a8fe5/pull-request-gardener_machine-controller-manager-pr_master/test/output
Unable to read coverage file  to combine, open : no such file or directory

Ginkgo ran 6 suites in 1m29.475338853s
Test Suite Failed

@gardener-robot gardener-robot added size/s Size of pull request is small (see gardener-robot robot/bots/size.py) and removed size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Oct 27, 2020
@rewiko
Copy link
Contributor Author

rewiko commented Oct 27, 2020

@prashanth26 done :)

@prashanth26
Copy link
Contributor

/lgtm
@hardikdr @ggaurav10 - want to have a look?

@gardener-robot gardener-robot added the reviewed/lgtm Has approval for merging label Oct 27, 2020
@prashanth26 prashanth26 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 27, 2020
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Oct 27, 2020
@prashanth26 prashanth26 merged commit 75a321f into gardener:master Oct 28, 2020
einfachnuralex added a commit to stackitcloud/machine-controller-manager that referenced this pull request Aug 20, 2021
* Prepare Next Dev Cycle v0.35.0-dev

* Wait also for CSI PVs during serial eviction

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

* add servergroup option

* [azure] Verify tags

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

* Allow deletion of machines where modify instance fails on deletion

* [OOT] Re-enabled support for bootstrap token injection

Signed-off-by: Andrey Klimentyev <andrey.klimentyev@flant.com>

* Azure MachineSet: VM ScaleSet or AvailabilitySet

Add generic MachineSet configuration in AzureMachineClass.
The machineSet field can be used to configure either a VMO or an AvailabilitySet for a MachineClass.

The availabilitySets field will be kept for a while but is now deprecated in favour of the machineset field.
The Azure compute sdk is also updated to support VMSS/VMO.

* Introduce constant backoff while enqueuing machine objects on creation and deletion failures

Co-authored-by: Prashanth <prashanth@sap.com>

* Bugfix: Finalizers are added by default on MC

* Added event handlers for MachCls

Fixed minor typo

* Removed multiple machineClass adds when pointing to same class.kind

* Fix typo in documentation

The documentation was pointing to the wrong branch. This change fixes it.

* Improve documentation with FAQ

Co-authored-by: Prashanth <prashanth@sap.com>
Co-authored-by: Samarth Deyagond <samarth.deyagond@sap.com>

* Enable NetworkUnavailable as node condition for machine health (gardener#543)

Add NetworkUnavailable as node condition to check while the machine is unhealthy

* Removing excessive spaces in FAQ

ToC links in the "Basics" section were broken due to this.

* OOT: Fixes drain failure timeout issues (gardener#548)

* OpenStack driver support to create VMs in a given subnet.

* Adapt integration tests to handle possibly orphaned resources

* Fix the OpenAPI validation of the machine-deployment crd

* Reduced default drain timeout to 2hours

* Bugfix: maxEvictRetries derived from drainTimeout

* OOT: Enqueue machine only when conditions change

* Split retry into Short Medium & Long periods

* Delete the orphan resources

* Backoff on machine creation/deletion faliures.

- Introdcued CrashLoopBackOff on machine creation failure

* Use listers to get the machine-object during reconciliation

* Allow migration to continue when machineClass with same name as ProviderMachineClass is found (gardener#559)

* Release v0.35.0

* Prepare Next Dev Cycle v0.36.0-dev

* Fix Azure machine deletion if resource group is gone

* Add FAQ for paused field

* Set Machine Phase to Terminating & make drain calculations based on deletionTimestamp (gardener#564)

* Fixes issues with machines being force drained

The machines were being force drained during race conditions due to PR gardener#492.
This commit fixes that by
	- Make the drain calculation based on deletion timestamp
	- Reverted logic from PR gardener#492 to first set machine Termination phase

* Update pkg/util/time/time_test.go

Fixed typo

Co-authored-by: Hardik Dodiya <hardik.dodiya@sap.com>

* Update docker images to use latest gcr copy

  - Updated from golang:1.13.5 to eu.gcr.io/gardener-project/3rd/golang:1.15.5
  - Updated from alpine:3.11.2 to eu.gcr.io/gardener-project/3rd/alpine:3.12.1

* Update docs/FAQ.md

Co-authored-by: Dieter Guendisch <dieter.guendisch@sap.com>

* Prevent panic when machine template hash length is < 5

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

* In-tree drivers work with credentials data map

They are no longer working with the whole secret because they don't need it. All of them are evaluating the data map only.

* Introduce `.spec.credentialsSecretRef` for in-tree machine classes

* Introduce `.spec.credentialsSecretRef` for out-of-tree machine class

* Support Gardener credentials data keys for Alicloud, AWS, Azure, GCP

* bump aws sdk version to v1.23.13

* Migration note for OOT providers (gardener#581)

* Updated Readme.md

* Release v0.36.0

* Prepare Next Dev Cycle v0.37.0-dev

* Update pull_request_template.md (gardener#582)

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

* Correcting the variable substitution

* docforge manifest (gardener#587)

* Check for misconfigured PDBs on Node drain and set proper error description

Signed-off-by: ialidzhikov <i.alidjikov@gmail.com>

* Fix for issue gardener#583 with UT

* Release v0.37.0

* Prepare next Dev Cycle v0.38.0-dev

* add component-descriptor-trait to head-update job

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.12.0

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.10.0

* Update eu.gcr.io/gardener-project/cc/job-image-golang to 0.11.0

* Run update before fetching on CI image (gardener#596)

* Improved NIC creation and deletion logic

- NIC creation now, adopts any existing NICs with matching name
- NIC deletion confirms deletion by performing GET on deletion
- Improved creation logs to give more details
- Revendored additional libraries

* Release v0.38.0

* Prepare next Dev Cycle v0.39.0-dev

* Updated links to various MCM OOT providers

* Fixed minor typo

* Updated link for metal stack

* skip drain readOnlyFileSystem logic and test case added (gardener#605)

Co-authored-by: Samarth Deyagond <samarth.deyagond@sap.com>

* Improved log details to include node name and provider ID (gardener#607)

Co-authored-by: Prashanth <prashanth@sap.com>

* Fixed formatting directive error

* Fix panic when secretRef cannot be found (gardener#609)

It would assign nil to secretRef, and this would panic when trying
to log out the original namespace/name.

* Fixes issue adding on finalizers on existing mach (gardener#611)

* Release v0.39.0

* Feature openstack volume type & dualstack customs

Fix podNetworkCidrs

* add openstack volumeType for machine

Co-authored-by: Maximilian Geberl <maximilian.geberl@stackit.de>

updated VERSION after rebase

updated driver_openstack.go

changed order of applied services

added allowed address pairs

Change make to einfachnuralex

Split podnetwork and run update port n times

changed version and repository

added more v6 stuff

* VERSION v0.39.0-ske-1

Co-authored-by: gardener-robot-ci-3 <gardener.ci.user3@gmail.com>
Co-authored-by: ialidzhikov <i.alidjikov@gmail.com>
Co-authored-by: Hardik Dodiya <hardik.dodiya@sap.com>
Co-authored-by: Konstantinos Angelopoulos <konstantinos.angelopoulos@sap.com>
Co-authored-by: Andrey Klimentyev <andrey.klimentyev@flant.com>
Co-authored-by: prashanth26 <prashanth@sap.com>
Co-authored-by: dkistner <dominickistner@yahoo.de>
Co-authored-by: Samarth Deyagond <samarth.deyagond@sap.com>
Co-authored-by: Anthony Comtois <anthony.comtois@sky.uk>
Co-authored-by: Dmitry Shurupov <dmitry.shurupov@flant.com>
Co-authored-by: Thomas Buchner <thomas.buchner@sap.com>
Co-authored-by: Kistner, Dominic <dominic.kistner@sap.com>
Co-authored-by: Dieter Guendisch <dieter.guendisch@sap.com>
Co-authored-by: Rafael Franzke <rafael.franzke@sap.com>
Co-authored-by: 郑佳金 <zhengjiajin@pingcap.com>
Co-authored-by: gardener-robot-ci-2 <gardener.ci.user2@gmail.com>
Co-authored-by: Samarth Deyagond <deyagondsamarth@gmail.com>
Co-authored-by: Georgi Pavlov <georgi.pavlov@sap.com>
Co-authored-by: Andreas Burger <andreas.burger@sap.com>
Co-authored-by: gardener-robot-ci-1 <gardener.ci.user@gmail.com>
Co-authored-by: Himanshu Sharma <79965161+himanshu-kun@users.noreply.github.com>
Co-authored-by: James Ravn <james@r-vn.org>
Co-authored-by: Manuel Ganter <manuel.ganter@hotmail.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) reviewed/lgtm Has approval for merging size/s Size of pull request is small (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants