[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

dulek · 2023-09-06T16:19:34Z

What this PR does / why we need it:
In test-lb-service.sh we do trap "delete_resources" EXIT to make sure we cleanup resources on a test failure. In there, we only fetched the $? after making a check for ${AUTO_CLEAN_UP}, which itself alters the code to 0, so function always returns success. This means tests can never really fail.

This commit fixes it by making sure $ERROR_CODE is fetched at the very beginning of the cleanup function.

Some additional fixes needed to be made to make tests passing again. In particular:

Deleting listeners when the secondary shared Service is deleted is fixed now by making sure we're comparing the correct tag.
Tests no longer create Services as internal in order to make shared LB tests work after [occm] Don't allow internal Services to share an LB #2190.
FIP is added to the user LB in the sharing test in order for CPO to not complain about not being able to create it for a secondary Service.

Which issue this PR fixes(if applicable):
fixes #2540

Special notes for reviewers:

Release note:

NONE

dulek · 2023-09-06T16:25:43Z

mdulko:cloud-provider-openstack/ (fix-test-failures*) $ bash tests/e2e/cloudprovider/test-lb-service.sh                                                                                                              
+ TIMEOUT=300                                                                                                                                                                                                        
+ FLOATING_IP=                                                                                                                                                                                                       
+ NAMESPACE=octavia-lb-test                                                                                                                                                                                          
+ GATEWAY_IP=                                                                                                                                                                                                        
+ DEVSTACK_OS_RC=/home/zuul/devstack/openrc                                                                                                                                                                          
+ CLUSTER_TENANT=demo                                                                                                                                                                                                
+ CLUSTER_USER=demo                                                                                                                                                                                                  
+ LB_SUBNET_NAME=private-subnet                                                                                                                                                                                      
+ AUTO_CLEAN_UP=true                                                                                                                                                                                                 
+ trap delete_resources EXIT                                                                                                                                                                                         
+ test_basic                                                                                                                                                                                                         
+ exit 1                                                                                                                                                                                                             
+ delete_resources                                                                                                                                                                                                   
+ [[ true != \t\r\u\e ]]                                                                                                                                                                                             
+ ERROR_CODE=0
+ exit 0

Here's how I tested this.

jichenjc · 2023-09-07T01:08:39Z

/test openstack-cloud-controller-manager-e2e-test

dulek · 2023-09-07T07:40:34Z

>>>>>>> FAIL: Timeout when waiting for the Service test-shared-2 created

Well, now we finally start to see failures.

/retest

dulek · 2023-09-07T07:43:08Z

I0906 16:55:52.107153      11 event.go:307] "Event occurred" object="octavia-lb-test/test-shared-2" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: internal Service cannot share a load balancer"

Okay, so this got broken when I started to limit LB sharing, but I never noticed because tests were busted.

mdbooth · 2023-09-07T09:24:01Z

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?

dulek · 2023-09-07T10:44:07Z

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?

mdulko:cloud-provider-openstack/ (provider-testing) $ git revert 788ea08fcf5714af8be3b0010bc6d6e5eb1d3a59
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: could not revert 788ea08fc... Don't allow internal Services to share an LB (#2190)

It's not that trivial anymore. Anyway I plan to look more closely at these tests, my current understanding is that adding service.beta.kubernetes.io/openstack-internal-load-balancer: "true" to the LB is just cargo cult from the previous test that did so to avoid unnecessary FIP allocation. I'm pretty sure this is all fixable.

dulek · 2023-09-07T10:46:21Z

@lingxiankong, do you know why we have this in place in tests?

dulek · 2023-09-07T13:29:27Z

+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:564 :   printf '\n>>>>>>> Checking the listener number for the load balancer 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b\n'
++ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   openstack loadbalancer status show 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b
++ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   jq '.loadbalancer.listeners | length'
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:565 :   listenerNum=2
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:566 :   [[ 2 != 1 ]]
+ tests/e2e/cloudprovider/test-lb-service.sh:test_shared_lb:567 :   printf '\n>>>>>>> FAIL: The listener number should be 1 for load balancer 9e56a8cc-4687-45a1-8c46-e5dbbb3f388b, actual: 2\n'

Going forward…

mdbooth · 2023-09-07T13:42:59Z

@dulek Is it worth reverting the previous change to land this quickly? I'm guessing we wouldn't have merged it if we knew it broke a test?
mdulko:cloud-provider-openstack/ (provider-testing) $ git revert 788ea08fcf5714af8be3b0010bc6d6e5eb1d3a59
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: could not revert 788ea08fc... Don't allow internal Services to share an LB (#2190)
It's not that trivial anymore. Anyway I plan to look more closely at these tests, my current understanding is that adding service.beta.kubernetes.io/openstack-internal-load-balancer: "true" to the LB is just cargo cult from the previous test that did so to avoid unnecessary FIP allocation. I'm pretty sure this is all fixable.

Probably only worth it if the fix drags on a bit, then. In the meantime we probably shouldn't merge LB-related changes

dulek · 2023-09-07T15:31:46Z

/retest

I can see in the logs that listener was removed and LB was ACTIVE again. It couldn't be the ACTIVE-PENDING_UPDATE dance, as we only allow Service to be deleted once it's processed and kubectl waits for that.

Let's try it once again.

With shared LBs we distinguish the elements by tagging them with the proper name of the LB that would be created for a Service if it wasn't created as shared. This commit fixes that comparison for listener deletion as code was always comparing the name of the primary LB.

dulek · 2023-09-08T14:20:45Z

/retest

I can see in the logs that listener was removed and LB was ACTIVE again. It couldn't be the ACTIVE-PENDING_UPDATE dance, as we only allow Service to be deleted once it's processed and kubectl waits for that.

Let's try it once again.

Okay, there was a bug, fix is added.

dulek · 2023-09-08T15:15:04Z

>>>>>>> Waiting for the Service test-shared-user-lb creation finished

>>>>>>> FAIL: Timeout when waiting for the Service test-shared-user-lb created

That's another one.

dulek · 2023-09-08T15:17:19Z

Okay, this one was simple, I left some internal annotations by mistake.

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs.

In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

dulek · 2023-09-11T17:07:34Z

Tests are actually passing again! 🥳

@jichenjc, @kayrus, @mdbooth, @zetaab, this is ready for reviews now.

jichenjc · 2023-09-12T02:12:02Z

/lgtm

@dulek given you actually create additional commit for other purpose (e.g compare the LB name) ,can you help
put that into #2360 (comment)
as well? Thanks

mdbooth

Second and third commits lgtm.

I don't understand the first commit yet.

mdbooth · 2023-09-12T10:26:12Z

pkg/openstack/loadbalancer.go

@@ -2390,7 +2390,7 @@ func (lbaas *LbaasV2) deleteLoadBalancer(loadbalancer *loadbalancers.LoadBalance
 					Protocol: proto,
 					Port:     int(port.Port),
 				}]
-				if isPresent && cpoutil.Contains(listener.Tags, loadbalancer.Name) {
+				if isPresent && cpoutil.Contains(listener.Tags, svcConf.lbName) {


Can you explain why this is different to loadbalancer.Name? I've read this function and the calling function, in which you're now setting svcConf.lbName to the return value of lbaas.GetLoadBalancerName. Does this imply that the loadbalancer name (retrieved by id?) is different to GetLoadBalancerName.

Yes, so this all boils down to the support for shared LBs that I believe is implemented in a way that is very problematic codewise.

How does it work? You add an annotation to the Service that features the LB ID you want that Service to use. The assumption is that the LB exists already and CPO will just add listener, pool and members (members should be the same really¹) to that LB. The secondary LB resources will be tagged with the LB name that would be used for the secondary Service so that we can distinguish them later on.

The bug here is that we looked up listeners comparing their tags with the name of the LB taken from OpenStack, which is based on the primary Service name. In that case we should use the secondary service name and this is the fix.

Footnotes

Now this makes me think that we might have a bug that 2 Services sharing an LB can fight for their view of LB members. Fixed on reconciliation, but not great. ↩

This feels very fragile, but fixing that is definitely outside of the scope of this PR.

/lgtm

It totally is and you'll probably be happy to see it disabled by default in OpenShift: openshift/cluster-cloud-controller-manager-operator#263.

dulek · 2023-09-12T15:58:45Z

/lgtm

@dulek given you actually create additional commit for other purpose (e.g compare the LB name) ,can you help put that into #2360 (comment) as well? Thanks

Done!

dulek · 2023-09-14T14:27:21Z

@jichenjc: Anything else to move this forward?

jichenjc · 2023-09-15T00:41:27Z

no, all good ~

/approve

k8s-ci-robot · 2023-09-15T00:41:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jichenjc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jichenjc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dulek · 2023-09-19T12:18:56Z

/cherry-pick release-1.27

k8s-infra-cherrypick-robot · 2023-09-19T12:19:31Z

@dulek: #2360 failed to apply on top of branch "release-1.27":

Applying: Compare proper LB name for shared LBs
Using index info to reconstruct a base tree...
M	pkg/openstack/loadbalancer.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Compare proper LB name for shared LBs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.27

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ling (kubernetes#2360) * Fix shared LBs tests PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

…ling (#2360) (#2367) * Fix shared LBs tests PR #2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

…ubernetes#2360) * Fix shared LBs tests PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

…2360) (#2430) * Fix shared LBs tests PR #2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function. Co-authored-by: Michal Dulko <mdulko@redhat.com>

…ling (kubernetes#2360) * Compare proper LB name for shared LBs With shared LBs we distinguish the elements by tagging them with the proper name of the LB that would be created for a Service if it wasn't created as shared. This commit fixes that comparison for listener deletion as code was always comparing the name of the primary LB. * Fix shared LBs tests PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

dulek · 2024-02-07T11:38:27Z

/cherry-pick release-1.28

For some reason this haven't made it to 1.28 branch, which is pretty bad. Trying to fix that now.

k8s-infra-cherrypick-robot · 2024-02-07T11:39:01Z

@dulek: #2360 failed to apply on top of branch "release-1.28":

Applying: Compare proper LB name for shared LBs
Using index info to reconstruct a base tree...
M	pkg/openstack/loadbalancer.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/openstack/loadbalancer.go
CONFLICT (content): Merge conflict in pkg/openstack/loadbalancer.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Compare proper LB name for shared LBs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.28

For some reason this haven't made it to 1.28 branch, which is pretty bad. Trying to fix that now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ling (kubernetes#2360) * Compare proper LB name for shared LBs With shared LBs we distinguish the elements by tagging them with the proper name of the LB that would be created for a Service if it wasn't created as shared. This commit fixes that comparison for listener deletion as code was always comparing the name of the primary LB. * Fix shared LBs tests PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

…ling (#2360) (#2537) * Compare proper LB name for shared LBs With shared LBs we distinguish the elements by tagging them with the proper name of the LB that would be created for a Service if it wasn't created as shared. This commit fixes that comparison for listener deletion as code was always comparing the name of the primary LB. * Fix shared LBs tests PR #2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs. * Make sure we don't mask LB tests failures In `test-lb-service.sh` we do `trap "delete_resources" EXIT` to make sure we cleanup resources on a test failure. In there, we only fetched the `$?` after making a check for `${AUTO_CLEAN_UP}`, which itself alters the code to 0, so function always returns success. This means tests can never really fail. This commit fixes it by making sure `$ERROR_CODE` is fetched at the very beginning of the cleanup function.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 6, 2023

k8s-ci-robot requested review from jichenjc and kayrus September 6, 2023 16:19

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 6, 2023

dulek force-pushed the fix-test-failures branch from f0e9984 to 12c6e3d Compare September 7, 2023 12:18

dulek force-pushed the fix-test-failures branch from 12c6e3d to 71dd1df Compare September 8, 2023 14:20

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 8, 2023

dulek force-pushed the fix-test-failures branch from 71dd1df to a118d1f Compare September 8, 2023 15:16

dulek added 2 commits September 11, 2023 17:58

Fix shared LBs tests

403c54d

PR kubernetes#2190 prohibited sharing an LB that is internal for security reasons. This commit fixes the shared LBs tests to not create internal LBs.

dulek force-pushed the fix-test-failures branch from a118d1f to 22a186a Compare September 11, 2023 15:58

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 11, 2023

k8s-ci-robot assigned jichenjc Sep 12, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2023

mdbooth reviewed Sep 12, 2023

View reviewed changes

k8s-ci-robot assigned mdbooth Sep 12, 2023

dulek changed the title ~~[occm] Make sure we don't mask LB tests failures~~ [occm] Make sure we don't mask LB tests failures and fix what was failing Sep 12, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2023

k8s-ci-robot merged commit f8fceaf into kubernetes:master Sep 15, 2023

dulek mentioned this pull request Feb 7, 2024

[release-1.28] [occm] Make sure we don't mask LB tests failures and fix what was failing #2537

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

dulek commented Sep 6, 2023 •

edited

Loading

dulek commented Sep 6, 2023

jichenjc commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

mdbooth commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

mdbooth commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 8, 2023

dulek commented Sep 8, 2023

dulek commented Sep 8, 2023

dulek commented Sep 11, 2023

jichenjc commented Sep 12, 2023

mdbooth left a comment

mdbooth Sep 12, 2023

dulek Sep 12, 2023 •

edited

Loading

mdbooth Sep 12, 2023

dulek Sep 12, 2023

dulek commented Sep 12, 2023

dulek commented Sep 14, 2023

jichenjc commented Sep 15, 2023

k8s-ci-robot commented Sep 15, 2023

dulek commented Sep 19, 2023

k8s-infra-cherrypick-robot commented Sep 19, 2023

dulek commented Feb 7, 2024

k8s-infra-cherrypick-robot commented Feb 7, 2024

[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

[occm] Make sure we don't mask LB tests failures and fix what was failing #2360

Conversation

dulek commented Sep 6, 2023 • edited Loading

dulek commented Sep 6, 2023

jichenjc commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

mdbooth commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 7, 2023

mdbooth commented Sep 7, 2023

dulek commented Sep 7, 2023

dulek commented Sep 8, 2023

dulek commented Sep 8, 2023

dulek commented Sep 8, 2023

dulek commented Sep 11, 2023

jichenjc commented Sep 12, 2023

mdbooth left a comment

Choose a reason for hiding this comment

mdbooth Sep 12, 2023

Choose a reason for hiding this comment

dulek Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Footnotes

mdbooth Sep 12, 2023

Choose a reason for hiding this comment

dulek Sep 12, 2023

Choose a reason for hiding this comment

dulek commented Sep 12, 2023

dulek commented Sep 14, 2023

jichenjc commented Sep 15, 2023

k8s-ci-robot commented Sep 15, 2023

dulek commented Sep 19, 2023

k8s-infra-cherrypick-robot commented Sep 19, 2023

dulek commented Feb 7, 2024

k8s-infra-cherrypick-robot commented Feb 7, 2024

dulek commented Sep 6, 2023 •

edited

Loading

dulek Sep 12, 2023 •

edited

Loading