Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark][cluster] rootcoord crash : panic: runtime error: invalid memory address or nil pointer dereference #8803

Closed
wangting0128 opened this issue Sep 28, 2021 · 3 comments · Fixed by #8835
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@wangting0128
Copy link
Contributor

wangting0128 commented Sep 28, 2021

Steps/Code to reproduce:

argo task: benchmark-bw8pk

test yaml:

client-configmap: client-index-sift1b-hnsw-ef500-perf
server-configmap: server-cluster-16c64m-indexnode-4

benchmark-bw8pk-1-milvus-rootcoord-b497998d-84k5d.log

server deployment:

NAME: benchmark-bw8pk-1
LAST DEPLOYED: Tue Sep 28 10:10:18 2021
NAMESPACE: qa-milvus
STATUS: deployed
REVISION: 1
TEST SUITE: None
I0928 10:11:26.565434      44 request.go:665] Waited for 1.130818723s due to client-side throttling, not priority and fairness, request: GET:https://kubernet
es.default.svc.cluster.local/apis/node.k8s.io/v1?timeout=32s
I0928 10:11:36.764721      44 request.go:665] Waited for 11.329927144s due to client-side throttling, not priority and fairness, request: GET:https://kuberne
tes.default.svc.cluster.local/apis/eventing.knative.dev/v1beta1?timeout=32s
NAME                                                   READY   STATUS    RESTARTS   AGE
benchmark-bw8pk-1-etcd-0                               1/1     Running            0          11m
benchmark-bw8pk-1-milvus-datacoord-6d79d94688-q69bw    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-datanode-8d9457f9-z8w7g       1/1     Running            0          11m
benchmark-bw8pk-1-milvus-indexcoord-65589c4d7-kh27l    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-9z6dh    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-mn8hg    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-qxk8g    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-wg2vf    1/1     Running            0          11m
benchmark-bw8pk-1-milvus-proxy-74fcc4487-nzb5w         1/1     Running            0          11m
benchmark-bw8pk-1-milvus-pulsar-7dc6f784-xkwgj         1/1     Running            0          11m
benchmark-bw8pk-1-milvus-querycoord-5f86c9d6bd-t4lfl   1/1     Running            0          11m
benchmark-bw8pk-1-milvus-querynode-56ff9975c8-6d2jb    1/1     Running            1          11m
benchmark-bw8pk-1-milvus-rootcoord-b497998d-84k5d      0/1     CrashLoopBackOff   4          11m
benchmark-69lhl-1-etcd-0                               1/1     Running            0          4h26m   10.97.17.32    qa-node014.zilliz.local   <none>           <none>
benchmark-69lhl-1-milvus-standalone-79ff56fcb6-ztnw4   1/1     Running            0          4h26m   10.97.16.32    qa-node013.zilliz.local   <none>           <none>
benchmark-69lhl-1-minio-6bbb77f459-dvfdv               1/1     Running            0          4h26m   10.97.17.33    qa-node014.zilliz.local   <none>           <none>
benchmark-9fndj-1-etcd-0                               1/1     Running            0          22m     10.97.17.64    qa-node014.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-datacoord-ff7cdb497-djqr6     1/1     Running            0          22m     10.97.8.48     qa-node006.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-datanode-77cfb94bb6-dfwbd     1/1     Running            0          22m     10.97.16.53    qa-node013.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-indexcoord-c9477b5fd-bczq2    1/1     Running            0          22m     10.97.8.47     qa-node006.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-indexnode-557bf76f7-pj5n7     1/1     Running            0          22m     10.97.10.102   qa-node008.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-indexnode-557bf76f7-vgbcb     1/1     Running            0          22m     10.97.16.54    qa-node013.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-indexnode-557bf76f7-wh42f     1/1     Running            0          22m     10.97.13.184   qa-node010.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-indexnode-557bf76f7-zlg2h     1/1     Running            0          22m     10.97.17.63    qa-node014.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-proxy-5878874768-664wb        1/1     Running            0          22m     10.97.10.101   qa-node008.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-pulsar-844f6bcb7d-ms6r4       1/1     Running            0          22m     10.97.11.96    qa-node009.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-querycoord-659f94f565-k8np6   1/1     Running            0          22m     10.97.8.46     qa-node006.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-querynode-bccd96dbc-czmqh     1/1     Running            0          22m     10.97.14.219   qa-node011.zilliz.local   <none>           <none>
benchmark-9fndj-1-milvus-rootcoord-bdd65b6b5-pn44p     1/1     Running            0          22m     10.97.8.45     qa-node006.zilliz.local   <none>           <none>
benchmark-9fndj-1-minio-64b8c8b95b-sx4tn               1/1     Running            0          22m     10.97.11.97    qa-node009.zilliz.local   <none>           <none>
benchmark-bw8pk-1-etcd-0                               1/1     Running            0          10m     10.97.17.67    qa-node014.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-datacoord-6d79d94688-q69bw    1/1     Running            0          10m     10.97.8.53     qa-node006.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-datanode-8d9457f9-z8w7g       1/1     Running            0          10m     10.97.7.41     qa-node005.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-indexcoord-65589c4d7-kh27l    1/1     Running            0          10m     10.97.7.38     qa-node005.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-9z6dh    1/1     Running            0          10m     10.97.17.69    qa-node014.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-mn8hg    1/1     Running            0          10m     10.97.17.68    qa-node014.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-qxk8g    1/1     Running            0          10m     10.97.14.221   qa-node011.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-indexnode-85c5656b4d-wg2vf    1/1     Running            0          10m     10.97.11.102   qa-node009.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-proxy-74fcc4487-nzb5w         1/1     Running            0          10m     10.97.7.39     qa-node005.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-pulsar-7dc6f784-xkwgj         1/1     Running            0          10m     10.97.17.70    qa-node014.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-querycoord-5f86c9d6bd-t4lfl   1/1     Running            0          10m     10.97.7.37     qa-node005.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-querynode-56ff9975c8-6d2jb    1/1     Running            1          10m     10.97.8.54     qa-node006.zilliz.local   <none>           <none>
benchmark-bw8pk-1-milvus-rootcoord-b497998d-84k5d      0/1     CrashLoopBackOff   3          10m     10.97.7.40     qa-node005.zilliz.local   <none>           <none>
benchmark-bw8pk-1-minio-6d864cb587-xmv8j               1/1     Running            0          10m     10.97.7.42     qa-node005.zilliz.local   <none>           <none>


client pod: benchmark-bw8pk-1169485829

client log:

[2021-09-28 10:17:40,634] [   DEBUG] - Milvus insert run in 1.73s (milvus_benchmark.client:49)
[2021-09-28 10:17:40,637] [   DEBUG] - Row count: 5850000 in collection: <sift_1b_128_l2> (milvus_benchmark.client:393)
[2021-09-28 10:17:40,637] [   DEBUG] - 5850000 (milvus_benchmark.runners.base:89)
[2021-09-28 10:17:41,847] [   DEBUG] - Start id: 5900000, end id: 5950000 (milvus_benchmark.runners.base:76)
[2021-09-28 10:21:27,251] [   ERROR] - error_code: UnexpectedError
reason: "syncTimestamp Failed:rpc error: code = DeadlineExceeded desc = context deadline exceeded"
 (pymilvus.client.grpc_handler:384)
[2021-09-28 10:21:27,252] [   ERROR] - Error: <DescribeCollectionException: (code=1, message=syncTimestamp Failed:rpc error: code = DeadlineExceeded desc = context deadline exceeded)> (pymilvus.client.grpc_handler:59)
[2021-09-28 10:21:27,252] [   ERROR] - Error: <DescribeCollectionException: (code=1, message=syncTimestamp Failed:rpc error: code = DeadlineExceeded desc = context deadline exceeded)> (pymilvus.client.grpc_handler:59)
[2021-09-28 10:21:27,253] [   ERROR] - <DescribeCollectionException: (code=1, message=syncTimestamp Failed:rpc error: code = DeadlineExceeded desc = context deadline exceeded)> (milvus_benchmark.client:155)
[2021-09-28 10:21:27,253] [   DEBUG] - Milvus insert run in 225.4s (milvus_benchmark.client:49)
[2021-09-28 10:21:27,256] [   ERROR] - Error: <BaseException: (code=1, message=syncTimeStamp Failed:state code = Initializing)> (pymilvus.client.grpc_handler:59)
[2021-09-28 10:21:27,258] [   ERROR] - <BaseException: (code=1, message=syncTimeStamp Failed:state code = Initializing)> (milvus_benchmark.main:114)
[2021-09-28 10:21:27,260] [   ERROR] - Traceback (most recent call last):
  File "main.py", line 83, in run_suite
    runner.prepare(**cases[0])
  File "/src/milvus_benchmark/runners/build.py", line 101, in prepare
    case_param["collection_size"], case_param["ni_per"])
  File "/src/milvus_benchmark/runners/base.py", line 129, in insert
    ni_time = self.insert_core(milvus, info, start_id, vectors)
  File "/src/milvus_benchmark/runners/base.py", line 89, in insert_core
    logger.debug(milvus.count())
  File "/src/milvus_benchmark/client.py", line 392, in count
    row_count = self._milvus.get_collection_stats(collection_name)["row_count"]
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/stub.py", line 61, in handler
    raise e
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/stub.py", line 45, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/stub.py", line 378, in get_collection_stats
    stats = handler.get_collection_stats(collection_name, timeout, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 65, in handler
    raise e
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 57, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/pymilvus/client/grpc_handler.py", line 1035, in get_collection_stats
    raise BaseException(status.error_code, status.reason)
pymilvus.client.exceptions.BaseException: <BaseException: (code=1, message=syncTimeStamp Failed:state code = Initializing)>
 (milvus_benchmark.main:115)
[2021-09-28 10:21:27,351] [   DEBUG] - {'_version': '0.1', '_type': 'metric', 'run_id': 1632823953, 'mode': 'local', 'server': <milvus_benchmark.metrics.models.server.Server object at 0x7fb751d10ba8>, 'hardware': <milvus_benchmark.metrics.models.hardware.Hardware object at 0x7fb751d10cc0>, 'env': <milvus_benchmark.metrics.models.env.Env object at 0x7fb751d10da0>, 'status': 'RUN_FAILED', 'err_message': '', 'collection': {'dimension': 128, 'metric_type': 'l2', 'dataset_name': 'sift_1b_128_l2', 'collection_size': 1000000000, 'other_fields': None, 'ni_per': 50000}, 'index': {'index_type': 'hnsw', 'index_param': {'M': 16, 'efConstruction': 500}}, 'search': None, 'run_params': None, 'metrics': {'type': 'insert_build_performance', 'value': {}}, 'datetime': '2021-09-28 10:12:33.828201', 'type': 'metric'} (milvus_benchmark.metric.api:29)

Expected result:

Actual results:

Deploy milvus with 4 indexnodes, rootcord crashes during the process of inserting data

Environment:

  • Milvus version(e.g. v2.0.0-RC2 or 8b23a93): aea7cc1
  • Deployment mode(standalone or cluster):cluster
  • SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.0.0rc7.dev18
  • OS(Ubuntu or CentOS):
  • CPU/Memory:
  • GPU:
  • Others:

Configuration file:

Additional context:

client-index-sift1b-hnsw-ef500-perf:

{
	"config.yaml": "insert_build_performance:
		  collections:
		    -
		      milvus:
		        db_config.primary_path: /test/milvus/distribued/sift_50m_128_l2_ivf_flat
		        wal_enable: true
		      collection_name: sift_1b_128_l2
		      ni_per: 50000
		      build_index: true
		      index_type: hnsw
		      index_param:
		        M: 16
		        efConstruction: 500
		"
}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Sep 28, 2021
@congqixia
Copy link
Contributor

This bug is caused by a known issue of pulsar sdk. Try to have a workaround before next pulsar sdk having a new release.

apache/pulsar-client-go#576

@congqixia
Copy link
Contributor

/reopen
/assign @wangting0128
Could you please verify whether bug fix works?

@sre-ci-robot
Copy link
Contributor

@congqixia: Reopened this issue.

In response to this:

/reopen
/assign @wangting0128
Could you please verify whether bug fix works?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sre-ci-robot sre-ci-robot reopened this Sep 29, 2021
@congqixia congqixia added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants