Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic #1133

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gcs278
Copy link
Contributor

@gcs278 gcs278 commented Aug 22, 2024

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

Additionally, the IBM DNS logic doesn't work for multiple targets and this creates unexpected and problematic results. The logic has been refactored to only create and delete using the first target. It warns the user when multiple targets are set.

This PR also includes some unit test fix up and missing unit test coverage for the IBM CIS Provider.

This resolves the same DNS issues for public PowerVS cloud as it uses the same logic.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Aug 22, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

This PR also includes some unit test fix up and missing coverage for the IBM CIS Provider.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci openshift-ci bot requested review from alebedev87 and frobware August 22, 2024 22:46
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from b387d07 to b07d602 Compare August 22, 2024 22:50
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 22, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Aug 22, 2024
@openshift-ci openshift-ci bot requested a review from lihongan August 22, 2024 22:57
@openshift-ci-robot
Copy link
Contributor

@gcs278: This pull request references Jira Issue OCPBUGS-32776, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

The IBM Public Cloud DNS provider (cis_provider.go) had a bug in createOrUpdateDNSRecord where it checked for the existence of a DNS record by filtering both DNS name and target. If the target was updated (e.g., due to a load balancer recreation), the logic would not match the existing DNS record. As a result, the function would attempt to create a new record, but fail because a record with that name already existed, as multiple DNS records with the same name are not allowed.

The fix is to remove the filtering by target and rely solely on filtering by name, as the name is the only attribute that needs to be unique.

This PR also includes some unit test fix up and missing coverage for the IBM CIS Provider.

This also resolves the same DNS issues for public PowerVS cloud as it uses the same logic.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gcs278
Copy link
Contributor Author

gcs278 commented Aug 26, 2024

I missed one detail of the bug, missing instructions in the Progressing status condition for PowerVS when you change scope. Looking into it.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2024
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from a057f1a to 0716651 Compare August 26, 2024 19:28
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 26, 2024

@lihongan Added the missing scope change instructions for the PowerVS type.

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 27, 2024

infra failures
/retest

@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from 0716651 to f51b1d3 Compare August 27, 2024 14:31
@gcs278
Copy link
Contributor Author

gcs278 commented Aug 28, 2024

Infra issues
/retest

@lihongan
Copy link
Contributor

pre-merge tested on OpenStack and looks good now

$ oc -n openshift-ingress-operator patch ingresscontroller/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy": {"type":"LoadBalancerService", "loadBalancer": {"scope":"External"}}}}'
ingresscontroller.operator.openshift.io/intlb patched

$ oc get co/ingress
NAME      VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
ingress   4.17.0-0.ci.test-2024-08-28-040831-ci-ln-13rxbgt-latest   True        True          False      23m     ingresscontroller "intlb" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: LoadBalancerProgressing=True (OperandsProgressing: One or more managed resources are progressing: The IngressController scope was changed from "Internal" to "External".  To effectuate this change, you must delete the service: `oc -n openshift-ingress delete svc/router-intlb`; the service load-balancer will then be deprovisioned and a new one created.  This will most likely cause the new load-balancer to have a different host name and IP address from the old one's.  Alternatively, you can revert the change to the IngressController: `oc -n openshift-ingress-operator patch ingresscontrollers/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy":{"loadBalancer":{"scope":"Internal"}}}}'`).

$ oc -n openshift-ingress delete svc/router-intlb
service "router-intlb" deleted

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.110.89   38.102.83.53   80:32114/TCP,443:30834/TCP   4m23s

will test on IBMCloud later

@lihongan
Copy link
Contributor

pre-merge tested on IBMCloud and also looks good

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                         PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.111.21   53084b0d-eu-de.lb.appdomain.cloud   80:31026/TCP,443:32398/TCP   78s

$ oc -n openshift-ingress-operator patch ingresscontroller/intlb --type=merge --patch='{"spec":{"endpointPublishingStrategy": {"type":"LoadBalancerService", "loadBalancer": {"scope":"External"}}}}'
ingresscontroller.operator.openshift.io/intlb patched

$ oc -n openshift-ingress delete svc/router-intlb
service "router-intlb" deleted

$ oc -n openshift-ingress get svc/router-intlb
NAME           TYPE           CLUSTER-IP      EXTERNAL-IP                         PORT(S)                      AGE
router-intlb   LoadBalancer   172.30.242.68   72736508-eu-de.lb.appdomain.cloud   80:32041/TCP,443:30579/TCP   65s

$ oc -n openshift-ingress-operator get ingresscontroller/intlb -oyaml
<......>
  - lastTransitionTime: "2024-08-28T09:24:02Z"
    message: The record is provisioned in all reported zones.
    reason: NoFailedZones
    status: "True"
    type: DNSReady

@Miciah
Copy link
Contributor

Miciah commented Aug 28, 2024

/assign

@Miciah
Copy link
Contributor

Miciah commented Aug 28, 2024

@SzucsAti, this is a follow-up to #796. Are you available to review the changes to the IBM DNS provider?

@gcs278 gcs278 changed the title OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic OCPBUGS-32776: Fix IBM Public Cloud DNS Provider Update Logic and Add Missing Instructions to the Progressing Condition Aug 28, 2024
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from f51b1d3 to 7a97b9a Compare August 28, 2024 15:29
Copy link
Contributor

@SzucsAti SzucsAti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you for the fix!

@Miciah Miciah added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Nov 13, 2024
@gcs278
Copy link
Contributor Author

gcs278 commented Dec 5, 2024

/retest

pkg/dns/ibm/private/dnssvcs_provider.go Outdated Show resolved Hide resolved
pkg/dns/ibm/private/dnssvcs_provider.go Outdated Show resolved Hide resolved
case iov1.CNAMERecordType:
inputRData, err := p.dnsService.NewResourceRecordInputRdataRdataCnameRecord(target)

// TODO DNS record update should handle the case where we have an A record and want a CNAME record or vice versa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is a carried-over TODO, but should it be done now in order to fix the bug? Or is it out of scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's out of scope for this bug since it can be fixed independently.

This TODO is for when someone changes the DNSRecord type from CNAME to A, or visa versa, that it won't update the dns record type. That's not a big deal for OpenShift actually, because IBMCloud only creates CNAME records, but I can see it as a low priority "completeness" TODO.

pkg/dns/ibm/public/cis_provider.go Outdated Show resolved Hide resolved
}
}
if result == nil || result.Result == nil {
return fmt.Errorf("createOrUpdateDNSRecord: ListAllDnsRecords returned nil as result")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for the very first record created? We can't find any to update, but we need to go on to create it instead of returning an error, don't we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good question. I found that result and result.Result are both initialized when creating the DNS record for the first time. The results gets filtered for the record.Spec.DNSName, so whether it's the very first DNS record in the hosted zone isn't a factor.

I agree it's a bit odd to check for result.Result, but the original author wrote this logic, I haven't update it as works as intended so far, and it's possible there is a subtle reason why it's here.

createDnsRecordInputOutput: dnsclient.CreateDnsRecordInputOutput{
InputId: "testCreate",
OutputError: nil,
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusOK},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to find a way to check that the list was updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above, there is no state to check.

OutputError: errors.New("error in CreateDnsRecord"),
OutputResponse: &core.DetailedResponse{StatusCode: http.StatusRequestTimeout},
},
expectErrorContains: "error in CreateDnsRecord",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really seem to be testing any error input if we're just matching the supplied error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I view OutputError as mocking a "fake" Upstream DNS Server as if it was providing an error. This is confirming that we are getting the error back that we told the "fake" Upstream DNS Server to generate, just like what would happen if there was a real DNS Server error.

We have other test cases that test error messages produced by our own provider code, but this error checking is still valuable because it proves the functions are returning the error we provided when the fake response is StatusRequestTimeout.

Comment on lines 57 to 59
func (fdc FakeDnsClient) ListResourceRecords(listResourceRecordsOptions *dnssvcsv1.ListResourceRecordsOptions) (result *dnssvcsv1.ListResourceRecords, response *core.DetailedResponse, err error) {
fakeListDnsrecordsResp := &dnssvcsv1.ListResourceRecords{}
recordType := string(iov1.ARecordType)
rData := map[string]interface{}{"ip": fdc.ListAllDnsRecordsInputOutput.RecordTarget}

fakeListDnsrecordsResp.ResourceRecords = append(fakeListDnsrecordsResp.ResourceRecords, dnssvcsv1.ResourceRecord{ID: &fdc.ListAllDnsRecordsInputOutput.RecordName, Name: &fdc.ListAllDnsRecordsInputOutput.RecordName, Type: &recordType, Rdata: rData})

resp := &core.DetailedResponse{
StatusCode: fdc.ListAllDnsRecordsInputOutput.OutputStatusCode,
Headers: map[string][]string{},
Result: result,
RawResult: []byte{},
}

return fakeListDnsrecordsResp, resp, fdc.ListAllDnsRecordsInputOutput.OutputError
return fdc.ListAllDnsRecordsInputOutput.OutputResult, fdc.ListAllDnsRecordsInputOutput.OutputResponse, fdc.ListAllDnsRecordsInputOutput.OutputError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listResourceRecordsOptions is unused. Also, can you add a comment about what is actually being returned here? We haven't changed anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, listResourceRecordsOptions was always unused. The options would only really be helpful if we were able to test the IBM Cloud DNS Provider, as it's an input struct for their API calls.

We build a ListAllDnsRecordsInputOutput object as an input to each of the unit test cases, and it has the OutputResult, OutputResponse, and OutputError. In this function, we just return them so we can continue testing our functions (delete and createOrUpdateDNSRecord)

I'll add a comment.

Comment on lines 13 to 16
"k8s.io/utils/pointer"

"github.com/IBM/go-sdk-core/v5/core"
"github.com/IBM/networking-go-sdk/dnssvcsv1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment for imports not in order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this further after learning more about my questions in cis_provider_test.go

@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from d93c10b to e486a51 Compare December 23, 2024 17:13
Copy link
Contributor

openshift-ci bot commented Dec 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from candita. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor Author

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@candita thanks for the detailed review. Should be ready for your review again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like they are implementing two different DNSClient interfaces (public and private).

Comment on lines 57 to 59
func (fdc FakeDnsClient) ListResourceRecords(listResourceRecordsOptions *dnssvcsv1.ListResourceRecordsOptions) (result *dnssvcsv1.ListResourceRecords, response *core.DetailedResponse, err error) {
fakeListDnsrecordsResp := &dnssvcsv1.ListResourceRecords{}
recordType := string(iov1.ARecordType)
rData := map[string]interface{}{"ip": fdc.ListAllDnsRecordsInputOutput.RecordTarget}

fakeListDnsrecordsResp.ResourceRecords = append(fakeListDnsrecordsResp.ResourceRecords, dnssvcsv1.ResourceRecord{ID: &fdc.ListAllDnsRecordsInputOutput.RecordName, Name: &fdc.ListAllDnsRecordsInputOutput.RecordName, Type: &recordType, Rdata: rData})

resp := &core.DetailedResponse{
StatusCode: fdc.ListAllDnsRecordsInputOutput.OutputStatusCode,
Headers: map[string][]string{},
Result: result,
RawResult: []byte{},
}

return fakeListDnsrecordsResp, resp, fdc.ListAllDnsRecordsInputOutput.OutputError
return fdc.ListAllDnsRecordsInputOutput.OutputResult, fdc.ListAllDnsRecordsInputOutput.OutputResponse, fdc.ListAllDnsRecordsInputOutput.OutputError
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, listResourceRecordsOptions was always unused. The options would only really be helpful if we were able to test the IBM Cloud DNS Provider, as it's an input struct for their API calls.

We build a ListAllDnsRecordsInputOutput object as an input to each of the unit test cases, and it has the OutputResult, OutputResponse, and OutputError. In this function, we just return them so we can continue testing our functions (delete and createOrUpdateDNSRecord)

I'll add a comment.

func (FakeDnsClient) CreateResourceRecord(createResourceRecordOptions *dnssvcsv1.CreateResourceRecordOptions) (result *dnssvcsv1.ResourceRecord, response *core.DetailedResponse, err error) {
return nil, nil, nil
func (fdc FakeDnsClient) CreateResourceRecord(createResourceRecordOptions *dnssvcsv1.CreateResourceRecordOptions) (result *dnssvcsv1.ResourceRecord, response *core.DetailedResponse, err error) {
if fdc.CreateDnsRecordInputOutput.InputId != *createResourceRecordOptions.Name {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this mainly to be consistent with the existing patterns, but it ensures that the
incoming createResourceRecordOptions.Name parameter is getting configured correctly. InputID is equivalent to the name, it's just named generically to keep all update/create/delete structs similar.

I'll add a comment.

dnsclient "github.com/openshift/cluster-ingress-operator/pkg/dns/ibm/public/client"

"k8s.io/utils/pointer"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that, but what about https://github.com/go-imports-organizer/goio/blob/main/examples/openshift_kubernetes.yaml? Do we have an official style guide of how we are supposed to order go imports?

pkg/dns/ibm/public/cis_provider_test.go Outdated Show resolved Hide resolved
for _, resourceRecord := range listResult.ResourceRecords {
if *resourceRecord.Name == dnsName {
if resourceRecord.ID == nil {
return fmt.Errorf("createOrUpdateDNSRecord: record id is nil")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because if the resourceRecord.ID is nil, you'll get a nil pointer dereference later when going to delete it, so no value in continuing. I added it because I saw cis_provider.go had it, but this provider didn't have it:

return fmt.Errorf("delete: record id is nil")

case iov1.CNAMERecordType:
inputRData, err := p.dnsService.NewResourceRecordInputRdataRdataCnameRecord(target)

// TODO DNS record update should handle the case where we have an A record and want a CNAME record or vice versa
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's out of scope for this bug since it can be fixed independently.

This TODO is for when someone changes the DNSRecord type from CNAME to A, or visa versa, that it won't update the dns record type. That's not a big deal for OpenShift actually, because IBMCloud only creates CNAME records, but I can see it as a low priority "completeness" TODO.

pkg/dns/ibm/public/cis_provider.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider.go Show resolved Hide resolved
}
}
if result == nil || result.Result == nil {
return fmt.Errorf("createOrUpdateDNSRecord: ListAllDnsRecords returned nil as result")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good question. I found that result and result.Result are both initialized when creating the DNS record for the first time. The results gets filtered for the record.Spec.DNSName, so whether it's the very first DNS record in the hosted zone isn't a factor.

I agree it's a bit odd to check for result.Result, but the original author wrote this logic, I haven't update it as works as intended so far, and it's possible there is a subtle reason why it's here.

@Miciah Miciah added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Jan 9, 2025
@candita
Copy link
Contributor

candita commented Jan 13, 2025

@gcs278 I didn't re-review this because it was on hold. I assume that was an oversight.
/cancel hold

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 14, 2025

@gcs278 I didn't re-review this because it was on hold. I assume that was an oversight. /cancel hold

@candita Yea I forgot to take the hold off this time, though I often will put holds to ensure every gets a chance to review, or I'm waiting for final CI results, but still want code reviews. I could be wrong. I use WIP as an in indicator it's not ready for review.

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2025
@candita
Copy link
Contributor

candita commented Jan 16, 2025

/retest

@@ -190,9 +194,15 @@ func (p *Provider) createOrUpdateDNSRecord(record *iov1.DNSRecord, zone configv1
log.Info("Warning: TTL must be one of [1 60 120 300 600 900 1800 3600 7200 18000 43200]. RecordTTL set to default", "default DSNSVCS record TTL", defaultDNSSVCSRecordTTL)
record.Spec.RecordTTL = defaultDNSSVCSRecordTTL
}
// We only support one target, warn the user.
if len(record.Spec.Targets) > 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this to ValidateInputDNSData? Shouldn't this error stop the addition of the record, rather than just logging it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this to ValidateInputDNSData?

We could, but I only added the warnings on creates, and ValidateInputDNSData is also called on deletes. And it feels odd to warn on deletes, because we actually do want to allow deleting with multiple targets for compatibility.

It's nuanced, but in the extremely rare case that someone was unsupportedly using multiple targets previously, the delete function will iterate through them to ensure all records are deleted to prevent an orphaned DNS record:

// While creating DNS records with multiple targets is unsupported, we still
// iterate through all targets during deletion to be extra cautious.
for _, target := range record.Spec.Targets {

Shouldn't this error stop the addition of the record, rather than just logging it?

I used CIO's AWS DNS provider as precedent, it doesn't break or error out, or provide a warning:

// TODO: handle >0 targets
domain, target := record.Spec.DNSName, record.Spec.Targets[0]

But I suppose erroring out would prevent a user from getting into a partially functioning state, requiring them to fix the targets upfront, to only provide what is supported. Maybe the AWS logic is bad precedent. That does seem like a better signal than a warning. I can update. But unfortunately still can't go into ValidateInputDNSData because of what I mentioned above (we still want the delete to continue with multiple targets).

Comment on lines -163 to -164
for _, target := range record.Spec.Targets {
listOpt.SetContent(target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, how did anything work with this here? But do we want to add just one target to the listOpt, or is that the issue you're fixing - we don't want to match on name + target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I was surprised too! Another reason I pushed for IBM Cloud E2E (which is unfortunately still in the works).

the issue you're fixing - we don't want to match on name + target?

Exactly. DNS Name is unique already, so that should be our only "key" for our search. The problem with filtering on name + target is that it works the first time, but when the target get updated, we will never find a result when the load balancer get recreated (the target changes and we only have the new target). That results in this code always attempting to create (new) the DNS Record, which fails when the DNS record actually exists.

Hence QE finding that you can create an IngressController in IBM just fine, but you cannot delete the load balancer-type service without DNS failing. That's a problem when our standard procedures (like a scope change) require a service deletion.

return fmt.Errorf("createOrUpdateDNSRecord: ListAllDnsRecords returned nil as result")
log.Info("created DNS record", "record", record.Spec, "zone", zone, "target", target)
} else {
if result.Result[0].ID == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's possible for len(result.Result) > 1, then you'd need to iterate over the results, wouldn't you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's safe to assume it would only ever been 1 results (DNS record) being returned. DNS Name is unique. So our search for DNS name should only ever return 1 record at most. Also, this assumption is already in place here, so I'm not introducing something new, just adding validation:

updateOpt := dnsService.NewUpdateDnsRecordOptions(*result.Result[0].ID)

Both IBM public and private cloud unit tests were missing unit test
coverage. This update extends test coverage for the Delete and
CreateOrUpdateRecord functions. This commit provides an important point
of reference for future commits that may preturb the existing
functionality.

Both Test_createOrUpdateDNSRecord functions previously only tested the
update logic. In order to test the create logic, `CreateDNSRecord` and
`CreateResourceRecord` needed to be implemented in the public and
private `fake_client.go` respectively.

The new test cases required the ability to control the response and results
of `ListAllDnsRecords`, which were previously hardcoded in both public
and private IBM cloud unit tests. Both public and private unit tests
were updated to use the new OutputResults field specified in
the `ListAllDnsRecordsInputOutput` struct, allowing the new test
cases to specify no result (indicating no existing DNS record) so we
can trigger the create logic.

Lastly, various test cases were added to cover untested scenarios, such
as testing CNAME Records, mismatching targets, missing record types or
IDs, handling nil results, etc.
The IBM Public Cloud DNS provider (`cis_provider.go`) had a bug in
`createOrUpdateDNSRecord` where it checked for the existence of a
DNS record by filtering both DNS name and target. If the target
was updated (e.g., due to a load balancer recreation), the logic
would not match the existing DNS record. As a result, the function
would attempt to create a new record, but fail because a record with
that name already existed, as multiple DNS records with the same
name are not allowed in IBM Cloud DNS providers.

The fix is to remove the filtering by target and rely solely on
filtering by name, as the name is the only attribute that needs
to be unique.

Additionally, the IBM DNS logic doesn't work for multiple targets and
this creates unexpected and problematic results. The logic has been
refactored to only create using the first target and it warns the user
when multiple targets are set. This change is low risk since the Ingress
Operator will never create a DNSRecord with multiple targets in
`desiredDNSRecord`.

Modify TestScopeChange to check for `Available=True` after deleting the
service to add test coverage for this issue.
@gcs278 gcs278 force-pushed the ibm-dns-createOrUpdateDNSRecord-bug branch from e486a51 to b0b91bb Compare January 17, 2025 02:01
Copy link
Contributor Author

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for another code review @candita! Comments should be addressed.

@@ -190,9 +194,15 @@ func (p *Provider) createOrUpdateDNSRecord(record *iov1.DNSRecord, zone configv1
log.Info("Warning: TTL must be one of [1 60 120 300 600 900 1800 3600 7200 18000 43200]. RecordTTL set to default", "default DSNSVCS record TTL", defaultDNSSVCSRecordTTL)
record.Spec.RecordTTL = defaultDNSSVCSRecordTTL
}
// We only support one target, warn the user.
if len(record.Spec.Targets) > 1 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add this to ValidateInputDNSData?

We could, but I only added the warnings on creates, and ValidateInputDNSData is also called on deletes. And it feels odd to warn on deletes, because we actually do want to allow deleting with multiple targets for compatibility.

It's nuanced, but in the extremely rare case that someone was unsupportedly using multiple targets previously, the delete function will iterate through them to ensure all records are deleted to prevent an orphaned DNS record:

// While creating DNS records with multiple targets is unsupported, we still
// iterate through all targets during deletion to be extra cautious.
for _, target := range record.Spec.Targets {

Shouldn't this error stop the addition of the record, rather than just logging it?

I used CIO's AWS DNS provider as precedent, it doesn't break or error out, or provide a warning:

// TODO: handle >0 targets
domain, target := record.Spec.DNSName, record.Spec.Targets[0]

But I suppose erroring out would prevent a user from getting into a partially functioning state, requiring them to fix the targets upfront, to only provide what is supported. Maybe the AWS logic is bad precedent. That does seem like a better signal than a warning. I can update. But unfortunately still can't go into ValidateInputDNSData because of what I mentioned above (we still want the delete to continue with multiple targets).

for _, resourceRecord := range listResult.ResourceRecords {
if *resourceRecord.Name == dnsName {
if resourceRecord.ID == nil {
return fmt.Errorf("createOrUpdateDNSRecord: record id is nil")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resourceRecord.Name must be unique in IBM Cloud DNS, and at this point in time we've already found a match with dnsName, that's the only DNS name. No point in continuing.

pkg/dns/ibm/private/dnssvcs_provider_test.go Outdated Show resolved Hide resolved
pkg/dns/ibm/private/dnssvcs_provider_test.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider_test.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider_test.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider_test.go Outdated Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider_test.go Show resolved Hide resolved
pkg/dns/ibm/public/cis_provider_test.go Show resolved Hide resolved
@candita
Copy link
Contributor

candita commented Jan 17, 2025

/retest

Copy link
Contributor

openshift-ci bot commented Jan 18, 2025

@gcs278: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ibmcloud-operator 2223383 link false /test e2e-ibmcloud-operator
ci/prow/okd-scos-e2e-aws-ovn b0b91bb link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-serial b0b91bb link true /test e2e-aws-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor Author

gcs278 commented Jan 18, 2025

e2e-aws-ovn-serial has issue that openshift/origin#29447 is attempting to fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants