Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listener: listen socket factory can be cloned from listener draining … #18686

Merged
merged 2 commits into from
Oct 20, 2021

Conversation

lambdai
Copy link
Contributor

@lambdai lambdai commented Oct 19, 2021

…filter chain

Commit Message:
Currently the in place updated old listener holds the listen fd during the drain timeout.

If the corresponding active listener is removed and added back, the added back listener creates fresh new passive sockets with new inodes.
The kernel spread the accepted requests to the old listen fd and new listenfd.

The accepted sockets to the old listen fd will be lost. Even somehow the kernel/ebpf requeue these sockets,
these sockets will be handled by envoy after the drain timeout is triggered.

This PR duplicate the listen fd from the old listener. The new listen fd share the same inode.
The newly create listener will consume the accepted sockets if any.

Signed-off-by: Yuchen Dai silentdai@gmail.com

Additional Description:
Risk Level:
Testing: Unittest and a bunch of istio e2e tests
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
Fix #18616
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

…filter chain

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
@lambdai
Copy link
Contributor Author

lambdai commented Oct 19, 2021

/assign @mattklein123

@lambdai
Copy link
Contributor Author

lambdai commented Oct 19, 2021

I would argue that #18677 is also needed.

In the test case, we add the listener 0.0.0.0:8080. It's happy that the requests arrived prior to the add is queued.

However, the add back is pretty much unpredictable.
I would expect stopping the listener 0.0.0.0:8080 means envoy would reject the new request ASAP.

@lambdai
Copy link
Contributor Author

lambdai commented Oct 20, 2021

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18686 (comment) was created by @lambdai.

see: more, trace.

@mattklein123
Copy link
Member

I would argue that #18677 is also needed.

Can you explain this more please? From our lengthy offline discussion I know don't understand how #18677 helps. As long as the sockets are duplicated/cloned, there should be no issue?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, makes sense to me with small comment.

/wait

source/server/listener_impl.cc Outdated Show resolved Hide resolved
Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We should backport this obviously.

@lambdai
Copy link
Contributor Author

lambdai commented Oct 20, 2021

I would argue that #18677 is also needed.

Can you explain this more please? From our lengthy offline discussion I know don't understand how #18677 helps. As long as the sockets are duplicated/cloned, there should be no issue?

There is no issue if we add a fresh new listener after we stop the listener on that address.

Think about why adding back a fresh listener can duplicate that draining filter chain listener.
It is because the old inplace updated listener doesn't close the listen socket during the drain_timeout after the listener is told to remove.

The state of the system is described here:
https://github.com/envoyproxy/envoy/pull/18686/files#diff-f8e00b71778057a3747a30c4fc3f3ad13e551cadebaf0eca9bdb476b42cadbcaR5112-R5117

@lambdai
Copy link
Contributor Author

lambdai commented Oct 20, 2021

Thanks! We should backport this obviously.

Thank you for the offline guidance and the quick approval!

@lambdai
Copy link
Contributor Author

lambdai commented Oct 20, 2021

/backport

Only need backport to 1.20

@repokitteh-read-only repokitteh-read-only bot added the backport/review Request to backport to stable releases label Oct 20, 2021
@mattklein123
Copy link
Member

Think about why adding back a fresh listener can duplicate that draining filter chain listener.
It is because the old inplace updated listener doesn't close the listen socket during the drain_timeout after the listener is told to remove.

Sorry I still don't understand. If you want to continue discussion on the other PR we can talk there, but I don't understand any scenario in which that PR will help. The real fix is cloning if we want to reuse a socket and not drop anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/approved Approved backports to stable releases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy v1.20.0 slower to re-initialize dynamic listeners
3 participants