Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: kubernetes discovery failed without recover #7489

Closed
tangzhenhuang opened this issue Jul 19, 2022 · 6 comments
Closed

bug: kubernetes discovery failed without recover #7489

tangzhenhuang opened this issue Jul 19, 2022 · 6 comments
Labels
bug Something isn't working

Comments

@tangzhenhuang
Copy link
Contributor

Current Behavior

The kubernetes discovery that the status code of 410 Gone was not processed in the list stage, so that the continue parameter could never be updated, and the list would continue to fail (returning to 410 Gone, an infinite loop)
image

Expected Behavior

When apiserver returns 410 Gone, update the continue parameter of the request instead of failing directly.

Error Logs

2022/07/18 11:25:49 [error] 390#390: *32338366 [lua] informer_factory.lua:292: list failed, kind: Endpoints, reason: Gone, message : {"kind":"Status","apiVersion":"v1","metadata":{"continue":"eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6LTEsInN0YXJ0IjoiM2Q4OWZlZTQwOTdjYzVhMzVjMDk0MDVmODJjOWNkOTA5NzQ4YTU5Ni1wcm8tcHJvL2htYVx1MDAwMCJ9"},"status":"Failure","message":"The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.","reason":"Expired","code":410}

Steps to Reproduce

kubernetes discovery list stage occur ocasionally.

Environment

  • APISIX version (run apisix version): 2.14.1
  • Operating system (run uname -a):
  • OpenResty / Nginx version (run openresty -V or nginx -V):
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@tangzhenhuang
Copy link
Contributor Author

@zhixiongdu027 Maybe you can check this?

@tokers tokers added the bug Something isn't working label Jul 19, 2022
@zhixiongdu027
Copy link
Contributor

@crazyMonkey1995 @tokers
Sorry, I didn't really expect a "Gone" error on "List"
I will try to verify and repair

@tangzhenhuang
Copy link
Contributor Author

@crazyMonkey1995 @tokers Sorry, I didn't really expect a "Gone" error on "List" I will try to verify and repair

However, this problem is difficult to reproduce. You can refer to the implementation of client-go. It seems that the way it handles 410 Gone is to relist()

@zhixiongdu027
Copy link
Contributor

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

Looking at the description of this text, the only solution to the Gone error is to "Clear Local Cache" and then Re-ListWatch
If so, maybe the current way isn't a problem

@crazyMonkey1995

@tangzhenhuang
Copy link
Contributor Author

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

Looking at the description of this text, the only solution to the Gone error is to "Clear Local Cache" and then Re-ListWatch If so, maybe the current way isn't a problem

@crazyMonkey1995

image
image

Here's the problem, this "informer" parameter will have a continue parameter, which will be set when listing, but you did not clear the continue parameter when returning 410, then the continue parameter is still there when retrying after 40 seconds, so it will continue for 410.

@zhixiongdu027
Copy link
Contributor

@crazyMonkey1995

I Will fix it, Many TKS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants