OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization #5179

csrwng · 2024-11-23T00:25:20Z

What this PR does / why we need it:
Moves client creation for the private link endpoint controller into the reconcile loop instead of when the reconciler is initially registered. This allows the controller to recover from initial issues assuming a shared vpc role.

Adds additional error logging when AWS cloud API calls fail.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-44698

Checklist

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci · 2024-11-23T00:25:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-11-23T00:26:58Z

@csrwng: This pull request references Jira Issue OCPBUGS-44698, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What this PR does / why we need it:
Moves client creation for the private link endpoint controller into the reconcile loop instead of when the reconciler is initially registered. This allows the controller to recover from initial issues assuming a shared vpc role.

Adds additional error logging when AWS cloud API calls fail.

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #OCPBUGS-44698

Checklist

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

enxebre · 2024-11-25T09:17:12Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

 			GroupId:       aws.String(sgID),
 			IpPermissions: ingressPermissions,
 		}); err != nil {
+			log.Error(err, "failed to sect security group ingress rules", "id", sgID)


typo sect?
Why are we unconditionally logging here then right below we are logging again via

if supportawsutil.AWSErrorCode(err) != "InvalidPermission.Duplicate" { return fmt.Errorf("failed to set security group ingress rules, code: %s", supportawsutil.AWSErrorCode(err)) } log.Info("WARNING: got duplicate permissions error when setting security group ingress permissions", "sgID", sgID)

typo sect?
yes

Will also fix the unconditional logging

enxebre · 2024-11-25T09:18:45Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

@@ -756,14 +861,17 @@ func (r *AWSEndpointServiceReconciler) createSecurityGroup(ctx context.Context,
 	describeSGInput := &ec2.DescribeSecurityGroupsInput{
 		GroupIds: []*string{aws.String(sgID)},
 	}
-	if err = r.ec2Client.WaitUntilSecurityGroupExistsWithContext(ctx, describeSGInput); err != nil {
+	if err = ec2Client.WaitUntilSecurityGroupExistsWithContext(ctx, describeSGInput); err != nil {
+		log.Error(err, "failed to wait for security group to exist", "id", sgID)


why are we logging in addition to the returned error and with different message?

The difference is that the returned error only contains the error code (it goes in the status of the resource) while the log line includes the full error returned from AWS.

I fixed the different error messages

So for example here we'll log
"failed to wait for security group to exist", "id", sgID"
and the controller runtime will log
"failed to wait for security group to exist (id: %s), code: %s", sgID, supportawsutil.AWSErrorCode(err)"

Seems unncecesary?

The difference is that the latter will also go into the status of the AWSEndpointService

enxebre · 2024-11-25T09:23:53Z

control-plane-operator/main.go

@@ -513,3 +488,7 @@ func NewStartCommand() *cobra.Command {

 	return cmd
 }
+
+func isAWS() bool {
+	return os.Getenv("AWS_REGION") != ""


Should this be:

func isAWSPrivate() bool { return os.Getenv("AWS_REGION") != "" && hcp.Spec.Platform.Type == hyperv1.AWSPlatform }

We are no longer retrieving the hcp on startup, but we're always setting AWS_REGION on the CPO when platform.Type == AWS:

hypershift/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go

Lines 2739 to 2770 in a3791d6

switch hc.Spec.Platform.Type {

case hyperv1.AWSPlatform:

deployment.Spec.Template.Spec.Volumes = append(deployment.Spec.Template.Spec.Volumes,

corev1.Volume{

Name: "cloud-token",

VolumeSource: corev1.VolumeSource{

EmptyDir: &corev1.EmptyDirVolumeSource{

Medium: corev1.StorageMediumMemory,

},

},

},

corev1.Volume{

Name: "provider-creds",

VolumeSource: corev1.VolumeSource{

Secret: &corev1.SecretVolumeSource{

SecretName: platformaws.ControlPlaneOperatorCredsSecret("").Name,

},

},

})

deployment.Spec.Template.Spec.Containers[0].Env = append(deployment.Spec.Template.Spec.Containers[0].Env,

corev1.EnvVar{

Name: "AWS_SHARED_CREDENTIALS_FILE",

Value: "/etc/provider/credentials",

},

corev1.EnvVar{

Name: "AWS_REGION",

Value: hc.Spec.Platform.AWS.Region,

},

corev1.EnvVar{

Name: "AWS_SDK_LOAD_CONFIG",

Value: "true",

})

So just retrieving the hcp on startup for the same check seemed unnecessary

ah I see. Where I was actually trying to get is that I guess we should only run this if we are in aws private, i.e. should we check hcp.spec.EndpointAccess?

Yup, we can. Since we need the hcp for that, I'll restore the code that fetched it on startup.

enxebre · 2024-11-25T09:30:25Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

+	if err != nil {
+		return nil, nil, err
+	}
+	if b.assumeEndpointRoleARN != "" {


nit:

// When sharedVPC we need assume these additional roles

?

enxebre · 2024-11-25T09:30:53Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

+type clientBuilder struct {
+	mu                    sync.Mutex
+	initialized           bool
+	assumeEndpointRoleARN string


assumeSharedVPCEndpointRoleARN
assumeSharedVPCRoute53RoleARN
?

enxebre · 2024-11-25T09:32:00Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

+	b.mu.Lock()
+	defer b.mu.Unlock()
+
+	if !b.initialized {


would it make any difference if this func is just

b.warnOnDifferentValues(log, hcp) b.setFromHCP(hcp) b.initialized = true

without the conditional?

we'd warn unnecessarily when initially setting the values

enxebre · 2024-11-26T08:13:07Z

control-plane-operator/controllers/awsprivatelink/awsprivatelink_controller.go

-		}
-		if !completed {
-			return ctrl.Result{RequeueAfter: endpointServiceDeletionRequeueDuration}, nil
+			log.Error(err, "failed to get AWS client, skipping aws endpoint service cleanup")


does this introduce a new scenario that wasn't possible before? we remove the finalizer if say a transient issue fails to get the client?

The only 2 possible errors are that the session cannot be created because env/cfg is incomplete or that the client builder was not initialized from a hcp. Either case is not recoverable with a retry.

enxebre · 2024-11-26T08:19:23Z

https://github.com/openshift/hypershift/pull/5179/files#r1856712444
#5179 (comment)
#5179 (comment)
/lgtm
/hold
feel free to cancel

enxebre · 2024-11-26T08:19:59Z

/retest

Moves client creation for the private link endpoint controller into the reconcile loop instead of when the reconciler is initially registered. This allows the controller to recover from initial issues assuming a shared vpc role. Adds additional error logging when AWS cloud API calls fail.

csrwng · 2024-11-26T21:59:41Z

/hold cancel

enxebre · 2024-11-27T08:59:33Z

/lgtm
/test e2e-aks

openshift-ci-robot · 2024-11-27T11:44:00Z

/retest-required

Remaining retests: 0 against base HEAD da24e17 and 2 for PR HEAD fda98ca in total

openshift-ci · 2024-11-27T13:39:03Z

@csrwng: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-11-27T16:05:25Z

/retest-required

Remaining retests: 0 against base HEAD f2d9716 and 2 for PR HEAD fda98ca in total

openshift-ci bot added the do-not-merge/needs-area label Nov 23, 2024

openshift-ci bot requested review from enxebre and isco-rodriguez November 23, 2024 00:25

openshift-ci bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Nov 23, 2024

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Nov 23, 2024

csrwng changed the title ~~Create AWS clients on every reconcile instead of at initialization~~ OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization Nov 23, 2024

enxebre reviewed Nov 25, 2024

View reviewed changes

csrwng force-pushed the shared_vpc_client_fix branch from 2890402 to 92b6418 Compare November 26, 2024 02:43

enxebre reviewed Nov 26, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024

openshift-ci bot assigned enxebre Nov 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2024

csrwng force-pushed the shared_vpc_client_fix branch from 92b6418 to fda98ca Compare November 26, 2024 19:22

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 26, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization #5179

OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization #5179

csrwng commented Nov 23, 2024

openshift-ci bot commented Nov 23, 2024

openshift-ci-robot commented Nov 23, 2024

enxebre Nov 25, 2024

csrwng Nov 25, 2024

enxebre Nov 25, 2024

csrwng Nov 26, 2024

enxebre Nov 26, 2024

csrwng Nov 26, 2024

enxebre Nov 25, 2024

csrwng Nov 25, 2024

enxebre Nov 25, 2024

csrwng Nov 26, 2024

enxebre Nov 25, 2024

enxebre Nov 25, 2024

enxebre Nov 25, 2024

csrwng Nov 25, 2024

enxebre Nov 26, 2024

csrwng Nov 26, 2024

enxebre commented Nov 26, 2024

enxebre commented Nov 26, 2024

csrwng commented Nov 26, 2024

enxebre commented Nov 27, 2024

openshift-ci-robot commented Nov 27, 2024

openshift-ci bot commented Nov 27, 2024

openshift-ci-robot commented Nov 27, 2024

	switch hc.Spec.Platform.Type {
	case hyperv1.AWSPlatform:
	deployment.Spec.Template.Spec.Volumes = append(deployment.Spec.Template.Spec.Volumes,
	corev1.Volume{
	Name: "cloud-token",
	VolumeSource: corev1.VolumeSource{
	EmptyDir: &corev1.EmptyDirVolumeSource{
	Medium: corev1.StorageMediumMemory,
	},
	},
	},
	corev1.Volume{
	Name: "provider-creds",
	VolumeSource: corev1.VolumeSource{
	Secret: &corev1.SecretVolumeSource{
	SecretName: platformaws.ControlPlaneOperatorCredsSecret("").Name,
	},
	},
	})
	deployment.Spec.Template.Spec.Containers[0].Env = append(deployment.Spec.Template.Spec.Containers[0].Env,
	corev1.EnvVar{
	Name: "AWS_SHARED_CREDENTIALS_FILE",
	Value: "/etc/provider/credentials",
	},
	corev1.EnvVar{
	Name: "AWS_REGION",
	Value: hc.Spec.Platform.AWS.Region,
	},
	corev1.EnvVar{
	Name: "AWS_SDK_LOAD_CONFIG",
	Value: "true",
	})

OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization #5179

Are you sure you want to change the base?

OCPBUGS-44698: Create AWS clients on every reconcile instead of at initialization #5179

Conversation

csrwng commented Nov 23, 2024

openshift-ci bot commented Nov 23, 2024

openshift-ci-robot commented Nov 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Nov 26, 2024

enxebre commented Nov 26, 2024

csrwng commented Nov 26, 2024

enxebre commented Nov 27, 2024

openshift-ci-robot commented Nov 27, 2024

openshift-ci bot commented Nov 27, 2024

openshift-ci-robot commented Nov 27, 2024