Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: stop discovery errors from bootstrap process #1465

Merged
merged 1 commit into from
Jun 29, 2023

Conversation

drivebyer
Copy link
Contributor

@drivebyer drivebyer commented Feb 24, 2023

In test env, we got error sometimes:
截屏2023-02-24 15 13 07

We should not affect by extension API server during bootstrap process. The ServerPreferredResources() first returned result still contain recource info although err is not nil, see:

for _, apiGroup := range apiGroups.Groups {
	for _, version := range apiGroup.Versions {
		groupVersion := schema.GroupVersion{Group: apiGroup.Name, Version: version.Version}
		wg.Add(1)
		go func() {
			defer wg.Done()
			defer utilruntime.HandleCrash()


			apiResourceList, err := d.ServerResourcesForGroupVersion(groupVersion.String())


			// lock to record results
			resultLock.Lock()
			defer resultLock.Unlock()


			if err != nil {
				// TODO: maybe restrict this to NotFound errors
				failedGroups[groupVersion] = err
			}
			if apiResourceList != nil {
				// even in case of error, some fallback might have been returned
				groupVersionResources[groupVersion] = apiResourceList
			}
		}()
	}
}

https://github.com/kubernetes/client-go/blob/ba2fdddad18cb9d1822c00ddd5ba8cb0a428ebe7/discovery/discovery_client.go#L512-L536

@drivebyer
Copy link
Contributor Author

@vadmeste
Copy link
Member

vadmeste commented Feb 24, 2023

@drivebyer do you know when this happens exactly ? old k8s version, a specific k8s flavor

@drivebyer
Copy link
Contributor Author

do you know when this happens exactly ?

hi @vadmeste ,My k8s version is:

[root@g-master1 ~]# kubectl get nodes
NAME        STATUS   ROLES           AGE   VERSION
g-master1   Ready    control-plane   44h   v1.24.7
g-master2   Ready    control-plane   44h   v1.24.7
g-master3   Ready    control-plane   44h   v1.24.7

I think it is not about to k8s version, but about the ServerPreferredResources() behivior in client-go. You can reproduce it by following step:

  1. create api service without api server behind:
[root@mcamel-dev-master ~]# kubectl apply -f /tmp/test.yaml
apiservice.apiregistration.k8s.io/v1alpha1.example.api.com created
[root@mcamel-dev-master ~]# cat /tmp/test.yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1alpha1.example.api.com
spec:
  group: example.api.com
  groupPriorityMinimum: 100
  versionPriority: 15
  service:
    name: example-service
    namespace: default
  version: v1alpha1
[root@mcamel-dev-master ~]# kubectl get apiservices.apiregistration.k8s.io | grep example
v1alpha1.example.api.com                             default/example-service   False (ServiceNotFound)   21s
  1. restart minio operator:
[root@mcamel-dev-master ~]# kubectl -n mcamel-system delete po minio-operator-79cf7b8996-wgn9d
pod "minio-operator-79cf7b8996-wgn9d" deleted
[root@mcamel-dev-master ~]# kubectl -n mcamel-system delete po minio-operator-79cf7b8996-dp87s
pod "minio-operator-79cf7b8996-dp87s" deleted
[root@mcamel-dev-master ~]# kubectl -n mcamel-system get po | grep minio
minio-operator-79cf7b8996-lpfm8                             0/1     Error       2 (17s ago)     22s
minio-operator-79cf7b8996-v9s4v                             1/1     Running     0               13s

then you can see error message:

[root@mcamel-dev-master ~]# kubectl -n mcamel-system logs minio-operator-79cf7b8996-v9s4v
I0225 06:02:56.002944       1 main.go:70] Starting MinIO Operator
I0225 06:02:56.637687       1 main.go:169] caBundle on CRD updated
I0225 06:02:56.638616       1 main-controller.go:244] Setting up event handlers
I0225 06:02:56.638856       1 leaderelection.go:248] attempting to acquire leader lease mcamel-system/minio-operator-lock...
I0225 06:02:56.647804       1 main-controller.go:496] minio-operator-79cf7b8996-lpfm8: is the leader, removing any leader labels that I 'minio-operator-79cf7b8996-v9s4v' might have
[root@mcamel-dev-master ~]# kubectl -n mcamel-system logs minio-operator-79cf7b8996-lpfm8
I0225 06:03:05.763885       1 main.go:70] Starting MinIO Operator
I0225 06:03:06.412054       1 main.go:169] caBundle on CRD updated
I0225 06:03:06.413340       1 main-controller.go:244] Setting up event handlers
I0225 06:03:06.413608       1 leaderelection.go:248] attempting to acquire leader lease mcamel-system/minio-operator-lock...
I0225 06:03:06.428900       1 leaderelection.go:258] successfully acquired lease mcamel-system/minio-operator-lock
I0225 06:03:06.428985       1 main-controller.go:478] minio-operator-79cf7b8996-lpfm8: I am the leader, applying leader labels on myself
I0225 06:03:06.429174       1 main-controller.go:390] Waiting for API to start
I0225 06:03:06.429638       1 main-controller.go:382] Starting HTTP Upgrade Tenant Image server
panic: unable to retrieve the complete list of server APIs: example.api.com/v1alpha1: the server is currently unable to handle the request

goroutine 99 [running]:
github.com/minio/operator/pkg/controller/cluster/certificates.GetCertificatesAPIVersion.func1()
	github.com/minio/operator/pkg/controller/cluster/certificates/csr.go:100 +0x1e5
sync.(*Once).doSlow(0x0?, 0x0?)
	sync/once.go:68 +0xc2
sync.(*Once).Do(...)
	sync/once.go:59
github.com/minio/operator/pkg/controller/cluster/certificates.GetCertificatesAPIVersion({0x1e97480?, 0xc0007a2780?})
	github.com/minio/operator/pkg/controller/cluster/certificates/csr.go:89 +0x5e
github.com/minio/operator/pkg/controller/cluster.(*Controller).Start.func1.1()
	github.com/minio/operator/pkg/controller/cluster/main-controller.go:350 +0x65
created by github.com/minio/operator/pkg/controller/cluster.(*Controller).Start.func1
	github.com/minio/operator/pkg/controller/cluster/main-controller.go:348 +0x10c

@harshavardhana
Copy link
Member

IMO we shouldn't panic, instead slowdown and retry here. This is a valid scenario, even if things are not running rightfully we must not panic here IMO.

We should keep the operator reconcile this situation and retry for ever.

@drivebyer
Copy link
Contributor Author

drivebyer commented Feb 28, 2023

We should keep the operator reconcile this situation and retry for ever.

If we retry here, the bootstrap process will hang unless the extension API server be healthy. IMO, the minio operator should not affected by the extension API server.

The apiservice v1.certificates.k8s.io is Local. So we only care about the k8s apiserver's health status.

[root@mcamel-dev-master ~]# kubectl get apiservices.apiregistration.k8s.io  | grep certificates
v1.certificates.k8s.io                               Local                    True        225d

@dvaldivia
Copy link
Collaborator

We need to slowdown our rate of requests to the k8s api as well

@drivebyer
Copy link
Contributor Author

We need to slowdown our rate of requests to the k8s api as well

The current call trace is main() -> Controller.Start() -> GetCertificatesAPIVersion(). GetCertificatesAPIVersion() core logic is:

func GetCertificatesAPIVersion(clientSet kubernetes.Interface) CSRVersion {
	// we will calculate which CSR version to use only once to avoid having to discover the
	certificateVersionOnce.Do(func() {
                 ...           
		switch version {
                 ...
		default:
			apiVersions, err := clientSet.Discovery().ServerPreferredResources()
                         ...
			for _, api := range apiVersions {
				if api.GroupVersion == "certificates.k8s.io/v1beta1" {
					csrVersion = CSRV1Beta1 // THE MEMORY VAR for later use
					break
				}
			}
		}
	})
	return csrVersion
}

Call ServerPreferredResources() only once. IMO, there is no place to adjust rate.

@drivebyer
Copy link
Contributor Author

drivebyer commented Mar 7, 2023

Copy link
Member

@vadmeste vadmeste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@drivebyer
Copy link
Contributor Author

drivebyer commented Jun 5, 2023

hi @harshavardhana @dvaldivia @cniackz @pjuarezd @cesnietor is there any suggestion about this pr

@pjuarezd pjuarezd merged commit e1f2a7a into minio:master Jun 29, 2023
@drivebyer drivebyer deleted the fix_discover_client_error branch June 30, 2023 01:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants