Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix e2e conformance test predicates conflict hostport #96627

Merged
merged 1 commit into from
Nov 20, 2020

Conversation

aojea
Copy link
Member

@aojea aojea commented Nov 17, 2020

What type of PR is this?

/kind cleanup
/kind failing-test

What this PR does / why we need it:

The e2e test, included as part of Conformance,
"validates that there is no conflict between
pods with same hostPort but different hostIP and protocol"
was only testing that the pods were scheduled without conflict
but was never testing the functionality.

The test should check that pods with containers forwarding the same
hostPort can be scheduled without conflict, and that those exposed
HostPort are forwarding the ports to the corresponding pods.

Which issue(s) this PR fixes:

Special notes for your reviewer:

The Conformance test "validates that there is no conflict between pods with same hostPort but different hostIP and protocol" now validates the connectivity to each hostPort, in addition to the functionality.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2020
@k8s-ci-robot
Copy link
Contributor

@aojea: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test labels Nov 17, 2020
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 17, 2020
@aojea
Copy link
Member Author

aojea commented Nov 17, 2020

/assign @BenTheElder @spiffxp @johnbelamaric

@aojea aojea changed the title correct e2e test predicates conflict hostport correct e2e conformance test predicates conflict hostport Nov 17, 2020
@aojea aojea force-pushed the hostport branch 2 times, most recently from 2b540f7 to e4d1618 Compare November 17, 2020 12:03
@aojea
Copy link
Member Author

aojea commented Nov 17, 2020

/retest

@aojea
Copy link
Member Author

aojea commented Nov 17, 2020

/retest
🤔 this only touches e2e

The e2e test, included as part of Conformance,
"validates that there is no conflict between
 pods with same hostPort but different hostIP and protocol"
was only testing that the pods were scheduled without conflict
but was never testing the functionality.

The test should check that pods with containers forwarding the same
hostPort can be scheduled without conflict, and that those exposed
HostPort are forwarding the ports to the corresponding pods.

the predicate tests were using loopback addresses for the the
hostPort test, however, those have different semantics depending
on the IP family, i.e. you can not bind to ::1 and ::2 simultanously,
in addition, IP forwarding from localhost to localhost in IPv6 is
not working since it doesn't have the kernel route_localnet hack.
@aojea
Copy link
Member Author

aojea commented Nov 18, 2020

for context, the actual test is wrong because it succeeds despite the pods are not doing portmapping, we hit that in Openshift CI

@aojea aojea changed the title correct e2e conformance test predicates conflict hostport fix e2e conformance test predicates conflict hostport Nov 18, 2020
@spiffxp
Copy link
Member

spiffxp commented Nov 20, 2020

/area conformance
/approve
/lgtm
/milestone v1.20
FYI @kubernetes/ci-signal

I'm allowing this in because we're not at test freeze, but if we see an increase in flakiness in this test over the weekend, or in PR's next week, we should revert this rather than try to fix-forward.

@k8s-ci-robot k8s-ci-robot added the area/conformance Issues or PRs related to kubernetes conformance tests label Nov 20, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Nov 20, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 20, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2020
@k8s-ci-robot k8s-ci-robot merged commit 8095565 into kubernetes:master Nov 20, 2020
@ravisantoshgudimetla
Copy link
Contributor

ravisantoshgudimetla commented Nov 25, 2020

I think this is causing a failure in case of Windows as hostNetwork is not supported on Windows hosts.

Ref: https://testgrid.k8s.io/sig-windows-releases#aks-engine-azure-windows-master-staging-serial-slow

@aojea - I think the job of scheduler is done when the pod has properly assigned a node, actual testing of networking to me should have been a networking test in test/e2e/network instead of where we have it currentlytest/e2e/scheduler. I think a decent compromise would be to skip running the ping/curl tests in case of Windows for now. Let me know if you want me to handle it.

This actually brings another interesting point. This is another scenario where running Windows tests earlier (atleast a subset) would have caught the problem, can we move to having the Windows tests as optional tests as a starting point, considering we have been deflaking the CI in the past few weeks.

cc @jsturtevant @marosset @m2 @BenTheElder

@aojea
Copy link
Member Author

aojea commented Nov 25, 2020

@aojea - I think the job of scheduler is done when the pod has properly assigned a node, actual testing of networking to me should have been a networking test in test/e2e/network instead of where we have it currentlytest/e2e/scheduler. I think a decent compromise would be to skip running the ping/curl tests in case of Windows for now. Let me know if you want me to handle it.

This is an e2e and conformance test, I personally think that it should verify that ALL works, not only the scheduler job. If the scheduler is able to schedule the pods, but once they are scheduled, they are not working or not exposing the ports, it is not useful for any user :)

I'm not familiar with windows at all but that testgrid link that you pasted only shows jobs running 56 tests, and not being able to run hostnetwork pods makes me think that it will fail in a lot of e2e and conformance test.

I personally think that we should not modify a conformance test to have a different behaviour per platform, and I would suggest to create new scheduler test with the previous behavior if this is really important for windows, but this is my opinion, I have to defer to the sig-testing and conformance people Ben, @spiffxp , @dims, @johnbelamaric ,.

@ravisantoshgudimetla
Copy link
Contributor

This is an e2e and conformance test, I personally think that it should verify that ALL works, not only the scheduler job. If the scheduler is able to schedule the pods, but once they are scheduled, they are not working or not exposing the ports, it is not useful for any user :)

Right, I am not saying we shouldn't test if the exposed ports are working or not. I am just saying that it's not part of scheduler job and perhaps this test should live elsewhere.

I'm not familiar with windows at all but that testgrid link that you pasted only shows jobs running 56 tests, and not being able to run hostnetwork pods makes me think that it will fail in a lot of e2e and conformance test.

Yeah, the dashboard link I shared is for serial tests

I personally think that we should not modify a conformance test to have a different behaviour per platform, and I would suggest to create new scheduler test with the previous behavior if this is really important for windows, but this is my opinion, I have to defer to the sig-testing and conformance people Ben, @spiffxp , @dims, @johnbelamaric ,.

I agree but hostNetwork is not clearly supported on Windows, I think the question is what should we do in these type of scenarios where different platforms behave differently in case of conformance tests.

@aojea
Copy link
Member Author

aojea commented Nov 26, 2020

Ahh ok, thanks for the clarification, agree with you on the points and the questions that you are raising 😄

@marosset
Copy link
Contributor

@adelina-t - She was also looking at some hostport issues raised by tests for Windows nodes.

@BenTheElder
Copy link
Member

Firstly: Windows "conformance" does not exist as of yet. Conformance testing targets linux.
If windows cannot support a test, we have existing tags for filtering out linux only tests.

The conformance test should not alter behavior based on platform.
For one thing the existing e2e.test platform detection mechanism is extremely broken / misguided, clusters can be linux + windows (in fact all of them are, you can't run the control plane on windows) but it detects the entire cluster are linux or windows when it's actually nodes that may be windows.

AFAIK hostNetwork is fair game for conformance tests but that would be up to the conformance project, if it is fair game and it can't work on windows then the linux only tag is appropriate rather than altering the behavior.

We are still trying to reduce the reliance on presubmit. AIUI windows tests are already running in post-submit, but people need to monitor the results.

@marosset
Copy link
Contributor

marosset commented Nov 30, 2020

AIUI windows tests are already running in post-submit, but people need to monitor the results.

@BenTheElder we've added a few pre-submit jobs to verify Windows functionality but currently they must be triggered manually.
A majority of the test coverage is covered post-submit tho.

https://github.com/kubernetes/test-infra/blob/750d12f6f39e286c8c590eb0c28d50b92cb33e02/config/jobs/kubernetes-sigs/sig-windows/sig-windows-config.yaml#L22 contains the triggers / job names.

@BenTheElder
Copy link
Member

👍 thanks, I also see some of the periodics have email alerting enabled 🎉
The alerting specifics also have a few knobs (like recency considered) that can be tuned as needed. Unfortunately many jobs in general do not have anyone subscribed to the alerts, so we lean pretty heavily on the release team's CI signal group ..
https://github.com/kubernetes/test-infra/blob/750d12f6f39e286c8c590eb0c28d50b92cb33e02/config/jobs/kubernetes-sigs/sig-windows/sig-windows-config.yaml#L366

@ravisantoshgudimetla
Copy link
Contributor

Firstly: Windows "conformance" does not exist as of yet. Conformance testing targets linux.
If windows cannot support a test, we have existing tags for filtering out linux only tests.

Thank you for the clarification @BenTheElder. I was under the impression that conformance tests are for both Linux and Windows.

For one thing the existing e2e.test platform detection mechanism is extremely broken / misguided, clusters can be linux + windows (in fact all of them are, you can't run the control plane on windows) but it detects the entire cluster are linux or windows when it's actually nodes that may be windows.

To be clear, I was thinking of using --node-os-distro=windows and then use NodeDistroIs flag in our e2e code but I think it violates the point you made earlier that conformance is for linux only. So, I am fine making the test Linux only for now.

I think we can change the ownership of this test to networking in a separate PR

@liggitt
Copy link
Member

liggitt commented Jan 25, 2021

is the mechanism this used to verify connectivity required/guaranteed to work on all conformant clusters?

@aojea
Copy link
Member Author

aojea commented Jan 26, 2021

is the mechanism this used to verify connectivity required/guaranteed to work on all conformant clusters?

we found that this test was passing for weeks despite it must be failing, because one of the pods failed to star with a conflict, so it restarted, cleaning the other hostports and removing the conflict. The result is that the pods are scheduled but they are not working as expected, you can not reach them in the HostPort expected, well, only one of them.

You have to verify that the host ports exposed are really working, so you can be sure that the each hostPort is being forwarded to the corresponding HostIP,

@aojea
Copy link
Member Author

aojea commented Jan 26, 2021

I have a followup PR to move the ownership to sig-network #98299

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants