Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for a minimum of 8 drives to come online (elapsed 12s) #1913

Closed
sathishkumar-p opened this issue Dec 17, 2023 · 23 comments
Closed

Waiting for a minimum of 8 drives to come online (elapsed 12s) #1913

sathishkumar-p opened this issue Dec 17, 2023 · 23 comments
Assignees
Labels
community question Further information is requested triage

Comments

@sathishkumar-p
Copy link

sathishkumar-p commented Dec 17, 2023

Hi,
I have deployed the minio in distributed mode using minio operator. When I trying to expand the pool. I am getting the following error:
Waiting for a minimum of 8 drives to come online (elapsed 12s)
I have nearly, waited for 3 hours for all pods get sync, but not work so reverted to the previous pool size.

Expected Behavior

The extra pool should added and the size of minio cluster should expanded.

Current Behavior

Failed to expand due to minimum drivers not coming online.

Steps to Reproduce (for bugs)

  1. Setup a minio disturbed cluster using the minio operator with 3 three pools, per pool 4 servers, and 4 drivers each of 100 Gi
  2. Add an extra pool-4, pods keep on restarting, not come online for 3 hours.

Context

This is my production environment. Now we are reaching the size limit. We are unable to expand it.

Regression

No quay.io/minio/operator:v5.0.6

Your Environment

  • Version used (minio-operator): quay.io/minio/operator:v5.0.6
  • Environment name and version (e.g. kubernetes v1.17.2): v1.26.10
  • Server type and version: Disturbed and RELEASE.2023-08-31T15-31-16Z
  • Operating System and version (uname -a):
  • Link to your deployment file: Linux minio-pool-1-0 5.15.0-89-generic docs: fix path to minio-operator.yaml file #99~20.04.1-Ubuntu SMP Thu Nov 2 15:16:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Complete: Error Log :
"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.98244083Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.98268901Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.982752021Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.98295671Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.982946786Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.990843037Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.991756289Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*fmt.wrapError","source":["internal/logger/logonce.go:118:logger.(*logOnceType).logOnceIf()","internal/logger/logonce.go:149:logger.LogOnceIf()","internal/rest/client.go:319:rest.(*Client).Call()","cmd/storage-rest-client.go:167:cmd.(*storageRESTClient).call()","cmd/storage-rest-client.go:567:cmd.(*storageRESTClient).ReadAll()","cmd/format-erasure.go:391:cmd.loadFormatErasure()","cmd/format-erasure.go:327:cmd.loadFormatErasureAll.func1()","github.com/minio/pkg@v1.7.5/sync/errgroup/errgroup.go:123:errgroup.(*Group).Go.func1()"]}} {"level":"ERROR","errKind":"ALL","time":"2023-12-17T04:43:31.9971456Z","api":{"name":"SYSTEM","args":{"bucket":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","object":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5"}},"remotehost":"5e76d207cf4ab20866fdc03c83e8a0f4e8f458e880777956ec0bae4e9f23f6c5","error":{"message":"*errors.errorString","source":["internal/logger/logger.go:258:logger.LogIf()","cmd/prepare-storage.go:254:cmd.connectLoadInitFormats()","cmd/prepare-storage.go:312:cmd.waitForFormatErasure()","cmd/erasure-server-pool.go:103:cmd.newErasureServerPools()","cmd/server-main.go:957:cmd.newObjectLayer()","cmd/server-main.go:704:cmd.serverMain.func9()","cmd/server-main.go:423:cmd.bootstrapTrace()","cmd/server-main.go:702:cmd.serverMain()"]}} {"level":"INFO","errKind":"","time":"2023-12-17T04:43:31.997294634Z","message":"Waiting for a minimum of 8 drives to come online (elapsed 12s)\n"}

@harshavardhana
Copy link
Member

@sathishkumar-p turn-off anonymous logging in the tenant deployment spec and share the actual logs.

@sathishkumar-p
Copy link
Author

Sure, let me do it

@sathishkumar-p
Copy link
Author

sathishkumar-p commented Dec 22, 2023

Hello @harshavardhana ,
Here are the logs after disabling the anonymous log:
FYI, Whenever i try to expand the size by adding pools the issue. The minio pods restart a lot of time and takes while to come back, but this time it taking more than 3 hrs but not back to online.
minio-pool-0-3.log
minio-pool-1-2.log

@allanrogerr
Copy link
Contributor

1.- Was expansion performed using the Console UI, kubectl minio or some other means?

2.- Was this tested successfully on a lower environment before moving to production?

3.- Please provide the pod logs for one pod from each of the original 3 pools, as well as for one pod from the expansion pool. e.g. kubectl -n <namespace> logs pod/minio-pool-0-3 > minio-pool-0-3.log

4.- The logs provided seem to replication related. However replication was not mentioned in the original description. Note that there have been several important fixes to replication since minio RELEASE.2023-08-31T15-31-16Z. When do you plan to upgrade.

@allanrogerr allanrogerr added the question Further information is requested label Dec 31, 2023
@sathishkumar-p
Copy link
Author

Hi, @allanrogerr

  1. Replication is performed using the helm tenant chart. I have added a new pool to the chart and argocd will do a sync.
  2. Yes, I have acted many times. both env without any problem
  3. Since its production, I worried about bringing it down many times. Will it be fine to perform in a lower environment (replicating the issue)?
  4. What is the stable version, not the latest version? Last time i have a garbage collection issue. Also Minio support suggested using the latest one not a stable one.

@sathishkumar-p
Copy link
Author

sathishkumar-p commented Jan 1, 2024

Node Expansion using helm chart of ' A Helm chart for MinIO Operator' tenant
- servers: 4
## custom name for the pool
name: pool-6
## volumesPerServer specifies the number of volumes attached per MinIO Tenant Pod / Server.
volumesPerServer: 4
## size specifies the capacity per volume
size: 200Mi
## storageClass specifies the storage class name to be used for this pool
storageClassName: cinder-retain
## Used to specify annotations for pods
annotations: { }
## Used to specify labels for pods
labels: { }
## Used to specify a toleration for a pod
tolerations: [ ]
## nodeSelector parameters for MinIO Pods. It specifies a map of key-value pairs. For the pod to be
## eligible to run on a node, the node must have each of the
## indicated key-value pairs as labels.
## Read more here: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
nodeSelector: { }
## Affinity settings for MinIO pods. Read more about affinity
## here: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity.
affinity: { }
## Configure resource requests and limits for MinIO containers
resources:
limits:
cpu: 2000m
memory: 1000Mi
requests:
cpu: 1500m
memory: 700Mi
## Configure security context
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
runAsNonRoot: true
## Configure container security context
containerSecurityContext:
runAsUser: 1000
runAsGroup: 1000
runAsNonRoot: true
## Configure topology constraints
topologySpreadConstraints: [ ]
## Configure Runtime Class
# runtimeClassName: ""

@sathishkumar-p
Copy link
Author

Hi @harshavardhana and @allanrogerr,
Please let me know if you required additional information

@allanrogerr
Copy link
Contributor

@sathishkumar-p
I believe you meant to say Expansion is performed using the helm tenant chart
It is best to reproduce the issue on a test environment first.
The latest stable version is RELEASE.2024-01-01T16-36-33Z as reported on https://min.io/docs/minio/linux/index.html

Still pending is the following:
Please provide the pod logs for one pod from each of the original 3 pools, as well as for one pod from the expansion pool. e.g. kubectl -n logs pod/minio-pool-0-3 > minio-pool-0-3.log

@sathishkumar-p
Copy link
Author

@allanrogerr,
I'm trying to reproduce the issue in a lower environment as well as prod. I will provide the logs and also upgrade the version of minio and operator

@jiuker
Copy link
Contributor

jiuker commented Jan 8, 2024

@sathishkumar-p Is there a new deploy will restart? And stuck there?

@sathishkumar-p
Copy link
Author

Yes its restating @jiuker

@sathishkumar-p
Copy link
Author

sathishkumar-p commented Jan 8, 2024

@allanrogerr and @harshavardhana
Here I have reproduced the issue in a lower environment. The last 2 days have kept on restarting and not come online till now.
Logs which you asked as per the format:
minio-pool-2-0.log
minio-pool-3-0.log
minio-pool-1-0.log
minio-pool-0-0.log
minio-pool-4-0.log

image

@jiuker
Copy link
Contributor

jiuker commented Jan 8, 2024

@allanrogerr and @harshavardhana Here I have reproduced the issue in a lower environment. The last 2 days have kept on restarting and not come online till now. Logs which you asked as per the format: minio-pool-2-0.log minio-pool-3-0.log minio-pool-1-0.log minio-pool-0-0.log minio-pool-4-0.log

image

It fix in minio/minio#17979
Your minio version miss this.
Could you retry this with latest version? @sathishkumar-p

@JasperWey
Copy link

I met the same problem, log like:

API: SYSTEM()
Time: 09:15:25 UTC 01/08/2024
Error: Read failed. Insufficient number of drives online (*errors.errorString)
       8: internal/logger/logger.go:258:logger.LogIf()
       7: cmd/prepare-storage.go:254:cmd.connectLoadInitFormats()
       6: cmd/prepare-storage.go:312:cmd.waitForFormatErasure()
       5: cmd/erasure-server-pool.go:104:cmd.newErasureServerPools()
       4: cmd/server-main.go:976:cmd.newObjectLayer()
       3: cmd/server-main.go:718:cmd.serverMain.func9()
       2: cmd/server-main.go:434:cmd.bootstrapTrace()
       1: cmd/server-main.go:716:cmd.serverMain()
Waiting for a minimum of 8 drives to come online (elapsed 21m31s)


API: SYSTEM()
Time: 09:15:26 UTC 01/08/2024
Error: Read failed. Insufficient number of drives online (*errors.errorString)
       8: internal/logger/logger.go:258:logger.LogIf()
       7: cmd/prepare-storage.go:254:cmd.connectLoadInitFormats()
       6: cmd/prepare-storage.go:312:cmd.waitForFormatErasure()
       5: cmd/erasure-server-pool.go:104:cmd.newErasureServerPools()
       4: cmd/server-main.go:976:cmd.newObjectLayer()
       3: cmd/server-main.go:718:cmd.serverMain.func9()
       2: cmd/server-main.go:434:cmd.bootstrapTrace()
       1: cmd/server-main.go:716:cmd.serverMain()
Waiting for a minimum of 8 drives to come online (elapsed 21m32s)

[root@node-136 minio]# k get po -n minio
NAME               READY   STATUS    RESTARTS   AGE
myminio-pool-0-0   2/2     Running   0          17m
myminio-pool-0-1   2/2     Running   0          17m
myminio-pool-0-2   2/2     Running   0          17m
myminio-pool-0-3   2/2     Running   0          17m

minio version:

  1. operator: v5.0.11
  2. minio: RELEASE.2023-11-15T20-43-25Z

@sathishkumar-p
Copy link
Author

yeah let me try with latest version of minio. Also should i update operator version ?

@harshavardhana
Copy link
Member

yeah let me try with latest version of minio. Also should i update operator version ?

Upgrading operator is important as well.

@sathishkumar-p
Copy link
Author

sathishkumar-p commented Jan 10, 2024

Sure i will upgrade the operator, but is that issue ? bcz @allanrogerr also facing the issue with latest minio version

@allanrogerr
Copy link
Contributor

@JasperWey What log is this? Please provide all pod logs.

@sathishkumar-p
1.- Please provide a complete output of the tenant spec on the lower env. Also, provide the output of:

kubectl -n <tenant namespace> get sts

Replace <tenant-namespace> where necessary

2.- For each sts get the output. Replace <sts-name> and <tenant-namespace> where necessary

kubectl -n <tenant-namespace> log sts/<sts-name> > <sts-name>.log

Several things could be wrong e.g. no pvs available. The logs should tell this. It would be easier to use the tenant console to perform the expansion if you are unable to get this running.

@allanrogerr allanrogerr self-assigned this Jan 11, 2024
@sathishkumar-p
Copy link
Author

Hello @allanrogerr ,
Here the logs of test env , as you described.
pool-0.log
pool-1.log
pool-2.log
pool-3.log
pool-4.log

@allanrogerr
Copy link
Contributor

1.- Please provide a complete output of the current tenant spec on the lower env.

From your steps to reproduce:

  1. Setup a minio disturbed cluster using the minio operator with 3 three pools, per pool 4 servers, and 4 drivers each of 100 Gi
  2. Add an extra pool-4, pods keep on restarting, not come online for 3 hours.

Please provide the exact steps on how you did this. I will attempt to reproduce your issue.

@sathishkumar-p
Copy link
Author

tenant spec:
tenant.txt

Do you think will this prb with PVC ?

@allanrogerr
Copy link
Contributor

@sathishkumar-p Sorry for the late response. The spec provides give no clues as to what your issue is. When I mention exact steps, I need to know what your working with.

Otherwise I can make a simple walkthru showing how to achieve what you're attempting to do using my own methods - probably a smaller setup.

@harshavardhana
Copy link
Member

This should be fixed in the latest MinIO release.

Please upgrade the image a type of a similar issue was addressed with MinIO server.

Thanks closing this for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

5 participants