[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

zhuwenxing · 2022-06-13T02:34:43Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-20220610-36ad9895
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The action has timed out when running hello_milvus.py
Outputting duplicate logs in proxy pod

[2022/06/12 18:57:51.851 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:52.355 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:52.356 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:52.861 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:52.866 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:53.370 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:53.370 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:53.873 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:53.877 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:54.379 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:54.381 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:54.884 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:54.885 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6851761481?check_suite_focus=true

Milvus Log

failed job: https://github.com/milvus-io/milvus/runs/6851761481?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/6898913792/artifacts/267696164

Anything else?

No response

The text was updated successfully, but these errors were encountered:

yanliang567 · 2022-06-13T08:00:06Z

@XuanYang-cn one more flush hang issue...
/assign @XuanYang-cn
/unassign

XuanYang-cn · 2022-06-13T08:03:06Z

may related to #17335, working on it

XuanYang-cn · 2022-06-13T08:06:49Z

/assign @zhuwenxing
Could you please have another try with the latest master commit?

zhuwenxing · 2022-06-13T09:01:53Z

It still happened in version master-20220613-e6225d92
Your PR is merged in the version

failed job: https://github.com/zhuwenxing/milvus/runs/6857375279?check_suite_focus=true
but the log is not collected due to the job is timeout, not the failure

I will try it again

zhuwenxing · 2022-06-13T10:15:23Z

failed job: https://github.com/zhuwenxing/milvus/runs/6858068872?check_suite_focus=true
log: https://github.com/zhuwenxing/milvus/suites/6904892315/artifacts/268116140

In GitHub action, this issue seems stable reproduced

zhuwenxing · 2022-06-13T10:42:56Z

It also failed in Jenkins pipeline。
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/6447/pipeline
log: artifacts-pulsar-pod-kill-6447-server-logs.tar.gz

XuanYang-cn · 2022-06-14T08:59:18Z

15143 time="2022-06-13T10:27:40Z" level=info msg="[Connected consumer]" consumerID=2 name=rmflb subscription=by-dev-dataNode-11-433878722880733185 topic="persistent://public/default/by-dev-rootcoord-dml_2"
15144 time="2022-06-13T10:27:40Z" level=warning msg="Closed connection unable add consumer with id=2" local_addr="10.97.4.243:47140" remote_addr="pulsar://pulsar-pod-kill-6447-proxy:6650"
15145 time="2022-06-13T10:27:40Z" level=info msg="[Reconnected consumer to broker]" consumerID=2 name=rmflb subscription=by-dev-dataNode-11-433878722880733185 topic="persistent://public/default/by-dev-rootcoord-dml_2"
15146 time="2022-06-13T10:27:40Z" level=warning msg="Connection closed unable register listener id=24" local_addr="10.97.4.243:47140" remote_addr="pulsar://pulsar-pod-kill-6447-proxy:6650"

The same reason as #14577, this time it's DataNode's consumer didn't reconnected with pulsar

Related issues:

apache/pulsar-client-go#785

apache/pulsar-client-go#733

apache/pulsar-client-go#698

XuanYang-cn · 2022-06-14T09:00:29Z

/assign @xige-16

XuanYang-cn · 2022-06-14T09:07:40Z

failed job: https://github.com/zhuwenxing/milvus/runs/6858068872?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/6904892315/artifacts/268116140

In GitHub action, this issue seems stable reproduced

 3955 time="2022-06-13T09:06:13Z" level=info msg="[Connected consumer]" consumerID=2 name=tiuxt subscription=by-dev-dataNode-10-433877445462920513 topic="persistent://public/default/by-dev-rootcoord-dml_0"
 3956 time="2022-06-13T09:06:13Z" level=warning msg="Closed connection unable add consumer with id=2" local_addr="10.244.0.8:39730" remote_addr="pulsar://test-pulsar-pod-kill-proxy:6650"
 3957 time="2022-06-13T09:06:13Z" level=info msg="[Reconnected consumer to broker]" consumerID=2 name=tiuxt subscription=by-dev-dataNode-10-433877445462920513 topic="persistent://public/default/by-dev-rootcoord-dml_0"
 3958 time="2022-06-13T09:06:13Z" level=info msg="[Connected consumer]" consumerID=46 name=cierj subscription=by-dev-dataNode-10-433877462149496833 topic="persistent://public/default/by-dev-rootcoord-dml_45"
 3959 time="2022-06-13T09:06:13Z" level=warning msg="Closed connection unable add consumer with id=46" local_addr="10.244.0.8:39730" remote_addr="pulsar://test-pulsar-pod-kill-proxy:6650"

XuanYang-cn · 2022-06-14T09:23:38Z

/unassign @xige-16 @XuanYang-cn
/assign @sunby

Please help to fix this

zhuwenxing · 2022-06-15T06:34:43Z

Flush hangs when running verify_all_collections.py after chaos

failed job:https://github.com/zhuwenxing/milvus/runs/6886845770?check_suite_focus=true
log: https://github.com/zhuwenxing/milvus/suites/6932123471/artifacts/269859110

sunby · 2022-06-17T02:47:43Z

/assign @zhuwenxing
please verify it.

zhuwenxing · 2022-06-17T14:31:58Z

failed job: https://github.com/zhuwenxing/milvus/runs/6936878605?check_suite_focus=true

/assign @sunby
Please take a look

sunby · 2022-06-21T02:17:32Z

Fixed in #17642

xiaofan-luan · 2022-06-21T08:39:51Z

@zhuwenxing pls help on verifying it

zhuwenxing · 2022-06-21T13:51:50Z

/assign @sunby
version master-20220621-746aeea3
failed job: http://10.100.32.144:8080/blue/organizations/jenkins/chaos-test/detail/chaos-test/6943/pipeline
log: artifacts-pulsar-pod-kill-6943-server-logs.tar.gz

sunby · 2022-06-22T07:53:19Z

There is a bug in datacoord. Consider this situation, a segment is allocated by datacoord and the proxy tries to insert some records to it. But pulsar is killed at this moment and this segment is still empty. After calling Flush on proxy, datacoord will retrun a segment list containing segments that are waiting to be flushed. But datacoord won't flush empty segment, so this segment's state is always Sealed and the flush hang.

I will fix it by setting the state of segment to Dropped if datacoord find this segment is empty. And the GetFlushState will check the segment whether is empty.

Thanks for the great job. @zhuwenxing

sunby · 2022-06-24T11:06:40Z

/unassign

sunby · 2022-06-24T11:07:12Z

/assign @zhuwenxing please verify

zhuwenxing · 2022-06-27T06:20:35Z

Not reproduced yet, remove the critical label.

stale · 2022-07-27T20:18:21Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

xiaofan-luan · 2022-08-21T02:50:01Z

I have one question about this issue.
By flushing(we should definitely name it as sync, flush is more like seal the segment), user want to ensure all the inserted data is synced to Minio.
After the change the semantic has been breaked, we have no guarantee that inserted data is on minio, which may cause data loss when you do backup or migration.
My suggestion is:

Flush should still return all known segment list related
Handle the case of empty segment flush to avoid hang, any thoughts?

zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2022

zhuwenxing assigned yanliang567 Jun 13, 2022

sre-ci-robot assigned XuanYang-cn and unassigned yanliang567 Jun 13, 2022

yanliang567 added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 13, 2022

yanliang567 added this to the 2.1-RC1 milestone Jun 13, 2022

sre-ci-robot assigned zhuwenxing Jun 13, 2022

XuanYang-cn unassigned zhuwenxing Jun 13, 2022

zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 13, 2022

yanliang567 removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 14, 2022

sre-ci-robot assigned xige-16 Jun 14, 2022

sre-ci-robot assigned sunby and unassigned xige-16 and XuanYang-cn Jun 14, 2022

sunby mentioned this issue Jun 16, 2022

Use v0.6.6 pulsar go client #17594

Merged

sre-ci-robot assigned zhuwenxing Jun 17, 2022

sunby mentioned this issue Jun 20, 2022

Use pulsar client go v0.6.8 #17642

Merged

xiaofan-luan unassigned sunby Jun 21, 2022

sre-ci-robot assigned sunby Jun 21, 2022

sunby mentioned this issue Jun 22, 2022

Drop empty sealed segment #17708

Merged

sre-ci-robot unassigned sunby Jun 24, 2022

zhuwenxing removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 27, 2022

congqixia mentioned this issue Jun 28, 2022

Drop empty sealed segment (#17708) #17782

Closed

stale bot added the stale indicates no udpates for 30 days label Jul 27, 2022

stale bot closed this as completed Aug 3, 2022

congqixia mentioned this issue Aug 18, 2022

Drop empty sealed segment (#17708) #18714

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

zhuwenxing commented Jun 13, 2022

yanliang567 commented Jun 13, 2022

XuanYang-cn commented Jun 13, 2022

XuanYang-cn commented Jun 13, 2022

zhuwenxing commented Jun 13, 2022 •

edited

Loading

zhuwenxing commented Jun 13, 2022

zhuwenxing commented Jun 13, 2022

XuanYang-cn commented Jun 14, 2022 •

edited

Loading

XuanYang-cn commented Jun 14, 2022

XuanYang-cn commented Jun 14, 2022

XuanYang-cn commented Jun 14, 2022

zhuwenxing commented Jun 15, 2022

sunby commented Jun 17, 2022

zhuwenxing commented Jun 17, 2022

sunby commented Jun 21, 2022

xiaofan-luan commented Jun 21, 2022

zhuwenxing commented Jun 21, 2022

sunby commented Jun 22, 2022 •

edited

Loading

sunby commented Jun 24, 2022

sunby commented Jun 24, 2022

zhuwenxing commented Jun 27, 2022

stale bot commented Jul 27, 2022

xiaofan-luan commented Aug 21, 2022

[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

Comments

zhuwenxing commented Jun 13, 2022

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Jun 13, 2022

XuanYang-cn commented Jun 13, 2022

XuanYang-cn commented Jun 13, 2022

zhuwenxing commented Jun 13, 2022 • edited Loading

zhuwenxing commented Jun 13, 2022

zhuwenxing commented Jun 13, 2022

XuanYang-cn commented Jun 14, 2022 • edited Loading

XuanYang-cn commented Jun 14, 2022

XuanYang-cn commented Jun 14, 2022

XuanYang-cn commented Jun 14, 2022

zhuwenxing commented Jun 15, 2022

sunby commented Jun 17, 2022

zhuwenxing commented Jun 17, 2022

sunby commented Jun 21, 2022

xiaofan-luan commented Jun 21, 2022

zhuwenxing commented Jun 21, 2022

sunby commented Jun 22, 2022 • edited Loading

sunby commented Jun 24, 2022

sunby commented Jun 24, 2022

zhuwenxing commented Jun 27, 2022

stale bot commented Jul 27, 2022

xiaofan-luan commented Aug 21, 2022

zhuwenxing commented Jun 13, 2022 •

edited

Loading

XuanYang-cn commented Jun 14, 2022 •

edited

Loading

sunby commented Jun 22, 2022 •

edited

Loading