Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic Error: Invalid Memory Address or Nil Pointer Dereference during Taint Deletion in AWSManagedMachinePool #5021

Open
nueavv opened this issue Jun 13, 2024 · 4 comments · May be fixed by #5022
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@nueavv
Copy link

nueavv commented Jun 13, 2024

/kind bug

What steps did you take and what happened:
While managing an EKS cluster using AWSManagedMachinePool, a panic error occurs when trying to delete a taint after adding it. Below are the related log messages.

Initial Error Log
E0613 05:58:09.596065 1 controller.go:329] "Reconciler error" err="panic: runtime error: invalid memory address or nil pointer dereference [recovered]" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="capi-managed-cluster/test-cluster" namespace="capi-managed-cluster" name="test-cluster" reconcileID="12345678-1234-1234-1234-123456789012"
I0613 05:58:09.917851 1 awsmanagedmachinepool_controller.go:200] "Reconciling AWSManagedMachinePool"
I0613 05:58:09.918011 1 launchtemplate.go:73] "checking for existing launch template"
E0613 05:58:10.373797 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 464 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1234560?, 0x1234560})
/go/pkg/mod/k8s.io/apimachinery@v0.29.3/pkg/util/runtime/runtime.go:75 +0x85
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:108 +0xb2
panic({0x1234560?, 0x1234560})
/usr/local/go/src/runtime/panic.go:914 +0x21f
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/converters.TaintsFromSDK({0xc001604d60, 0x1, 0x25})
/workspace/pkg/cloud/converters/eks.go:115 +0x258
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).createTaintsUpdate(0xc002522d20, {0x0, 0x0, 0x0}, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:420 +0x21d
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroupConfig(0xc002522d20, 0xc00075c5a0)
/workspace/pkg/cloud/services/eks/nodegroup.go:469 +0x36e
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).reconcileNodegroup(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/nodegroup.go:571 +0x72c
sigs.k8s.io/cluster-api-provider-aws/v2/pkg/cloud/services/eks.(*NodegroupService).ReconcilePool(0xc002522d20, {0x1234560, 0xc002528570})
/workspace/pkg/cloud/services/eks/eks.go:109 +0x11a
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).reconcileNormal(0xc001541980, {0x1234560, 0xc002528570}, 0xc0010baf00, {0x1234560, 0xc002138100})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:239 +0x4da
sigs.k8s.io/cluster-api-provider-aws/v2/exp/controllers.(*AWSManagedMachinePoolReconciler).Reconcile(0xc001541980, {0x1234560, 0xc002528570}, {{{0xc000913320?, 0x0?}, {0xc000913380?, 0xc000d75d08?}}})
/workspace/exp/controllers/awsmanagedmachinepool_controller.go:192 +0x8b7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1234560?, {0x1234560?, 0xc002528570?}, {{{0xc000913320?, 0xb?}, {0xc000913380?, 0x0?}}})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0014f4500, {0x1234560, 0xc0014264b0}, {0x1234560?, 0xc002360ea0?})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0014f4500, {0x1234560, 0xc0014264b0})
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:266 +0x1af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg.internal.controller.(*Controller).Start.func2 in goroutine 321
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.17.3/pkg/internal/controller/controller.go:223 +0x565
Updated Error Log
E0613 07:06:08.244153 1 controller.go:329] "Reconciler error" err=<
failed to reconcile machine pool for AWSManagedMachinePool capi-managed-cluster/test-cluster: failed to reconcile nodegroup config: creating taints update payload: converting taints: taint has nil fields: {
Effect: "PREFER_NO_SCHEDULE",
Key: "test"
}

What did you expect to happen:
Taints should be deleted without causing a panic error.

Anything else you would like to add:

  • The panic error occurs when trying to delete a taint after adding it.
  • The error message indicates that some fields of the taint are nil, but the provided taints are complete.

Current TaintsFromSDK function implementation:

import (
	"fmt"
	"log"
)

// TaintsFromSDK is used to convert an array of AWS SDK taints to CAPA Taints.
func TaintsFromSDK(taints []*eks.Taint) (expinfrav1.Taints, error) {
	converted := expinfrav1.Taints{}

	// If taints array is empty, log the event
	if len(taints) == 0 {
		log.Println("Taints array is empty")
		return converted, nil
	}

	for _, taint := range taints {
		if taint.Effect == nil || taint.Key == nil {
			return nil, fmt.Errorf("taint has nil fields: %+v", taint)
		}

		// Handle nil Value field by setting it to an empty string
		value := ""
		if taint.Value != nil {
			value = *taint.Value
		}

		convertedEffect, err := TaintEffectFromSDK(*taint.Effect)
		if err != nil {
			return nil, fmt.Errorf("converting taint effect %s: %w", *taint.Effect, err)
		}
		converted = append(converted, expinfrav1.Taint{
			Effect: convertedEffect,
			Key:    *taint.Key,
			Value:  value,
		})
	}

	return converted, nil
}

Suggestions:

Resolve the panic error that occurs during taint deletion.
When the taint array is empty, ensure that existing taints are deleted.
Review the logic handling empty taint arrays and nil fields.

Environment:

Cluster-api-provider-aws version: [version] 2.5.0
Kubernetes version: (use kubectl version): [version] 1.25.5 on local kubernetes
OS (e.g. from /etc/os-release): [version] ubuntu 20.04

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Jun 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 13, 2024
@nueavv
Copy link
Author

nueavv commented Jun 13, 2024

The issue appears to be caused by the value field in the taint having an empty string. The taint configuration is as follows:
image
This empty string in the value field might be leading to the observed behavior.

nueavv added a commit to nueavv/cluster-api-provider-aws that referenced this issue Jun 13, 2024
This commit addresses a panic error that occurs when attempting to delete a taint in AWSManagedMachinePool. The issue was caused by a nil pointer dereference when the taint's value was nil. By handling nil values for taint fields, this fix ensures that taints with nil values are correctly processed, preventing runtime errors.

Fixes [kubernetes-sigs#5021](kubernetes-sigs#5021)

Signed-off-by: nueavv <nuguni@kakao.com>
nueavv added a commit to nueavv/cluster-api-provider-aws that referenced this issue Aug 29, 2024
This commit addresses a panic error that occurs when attempting to delete a taint in AWSManagedMachinePool. The issue was caused by a nil pointer dereference when the taint's value was nil. By handling nil values for taint fields, this fix ensures that taints with nil values are correctly processed, preventing runtime errors.

Fixes [kubernetes-sigs#5021](kubernetes-sigs#5021)

Signed-off-by: nueavv <nuguni@kakao.com>
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
nueavv added a commit to nueavv/cluster-api-provider-aws that referenced this issue Sep 30, 2024
This commit addresses a panic error that occurs when attempting to delete a taint in AWSManagedMachinePool. The issue was caused by a nil pointer dereference when the taint's value was nil. By handling nil values for taint fields, this fix ensures that taints with nil values are correctly processed, preventing runtime errors.

Fixes [kubernetes-sigs#5021](kubernetes-sigs#5021)

Signed-off-by: nueavv <nuguni@kakao.com>
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
3 participants