bug: kubernetes discovery failed without recover #7489

tangzhenhuang · 2022-07-19T03:59:14Z

Current Behavior

The kubernetes discovery that the status code of 410 Gone was not processed in the list stage, so that the continue parameter could never be updated, and the list would continue to fail (returning to 410 Gone, an infinite loop)

Expected Behavior

When apiserver returns 410 Gone, update the continue parameter of the request instead of failing directly.

Error Logs

2022/07/18 11:25:49 [error] 390#390: *32338366 [lua] informer_factory.lua:292: list failed, kind: Endpoints, reason: Gone, message : {"kind":"Status","apiVersion":"v1","metadata":{"continue":"eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6LTEsInN0YXJ0IjoiM2Q4OWZlZTQwOTdjYzVhMzVjMDk0MDVmODJjOWNkOTA5NzQ4YTU5Ni1wcm8tcHJvL2htYVx1MDAwMCJ9"},"status":"Failure","message":"The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.","reason":"Expired","code":410}

Steps to Reproduce

kubernetes discovery list stage occur ocasionally.

Environment

APISIX version (run apisix version): 2.14.1
Operating system (run uname -a):
OpenResty / Nginx version (run openresty -V or nginx -V):
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

tangzhenhuang · 2022-07-19T04:00:10Z

@zhixiongdu027 Maybe you can check this?

zhixiongdu027 · 2022-07-20T01:42:10Z

@crazyMonkey1995 @tokers
Sorry, I didn't really expect a "Gone" error on "List"
I will try to verify and repair

tangzhenhuang · 2022-07-20T01:49:16Z

@crazyMonkey1995 @tokers Sorry, I didn't really expect a "Gone" error on "List" I will try to verify and repair

However, this problem is difficult to reproduce. You can refer to the implementation of client-go. It seems that the way it handles 410 Gone is to relist()

zhixiongdu027 · 2022-07-20T02:01:52Z

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

Looking at the description of this text, the only solution to the Gone error is to "Clear Local Cache" and then Re-ListWatch
If so, maybe the current way isn't a problem

@crazyMonkey1995

tangzhenhuang · 2022-07-20T09:49:26Z

A given Kubernetes server will only preserve a historical record of changes for a limited time. Clusters using etcd 3 preserve changes in the last 5 minutes by default. When the requested watch operations fail because the historical version of that resource is not available, clients must handle the case by recognizing the status code 410 Gone, clearing their local cache, performing a new get or list operation, and starting the watch from the resourceVersion that was returned.

Looking at the description of this text, the only solution to the Gone error is to "Clear Local Cache" and then Re-ListWatch If so, maybe the current way isn't a problem

@crazyMonkey1995

Here's the problem, this "informer" parameter will have a continue parameter, which will be set when listing, but you did not clear the continue parameter when returning 410, then the continue parameter is still there when retrying after 40 seconds, so it will continue for 410.

zhixiongdu027 · 2022-07-20T10:23:23Z

@crazyMonkey1995

I Will fix it, Many TKS

tokers added the bug Something isn't working label Jul 19, 2022

zhixiongdu027 mentioned this issue Jul 20, 2022

fix: cleanup variable before re list and watch #7506

Merged

4 tasks

spacewander closed this as completed Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: kubernetes discovery failed without recover #7489

bug: kubernetes discovery failed without recover #7489

tangzhenhuang commented Jul 19, 2022

tangzhenhuang commented Jul 19, 2022

zhixiongdu027 commented Jul 20, 2022

tangzhenhuang commented Jul 20, 2022

zhixiongdu027 commented Jul 20, 2022

tangzhenhuang commented Jul 20, 2022

zhixiongdu027 commented Jul 20, 2022

bug: kubernetes discovery failed without recover #7489

bug: kubernetes discovery failed without recover #7489

Comments

tangzhenhuang commented Jul 19, 2022

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

tangzhenhuang commented Jul 19, 2022

zhixiongdu027 commented Jul 20, 2022

tangzhenhuang commented Jul 20, 2022

zhixiongdu027 commented Jul 20, 2022

tangzhenhuang commented Jul 20, 2022

zhixiongdu027 commented Jul 20, 2022