Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Flush hangs when running hello_milvus.py after pulsar recovered from pod kill chaos #17508

Closed
1 task done
zhuwenxing opened this issue Jun 13, 2022 · 22 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20220610-36ad9895
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The action has timed out when running hello_milvus.py
Outputting duplicate logs in proxy pod

[2022/06/12 18:57:51.851 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:52.355 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:52.356 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:52.861 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:52.866 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:53.370 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:53.370 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:53.873 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:53.877 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:54.379 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:54.381 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]
[2022/06/12 18:57:54.884 +00:00] [INFO] [impl.go:3573] ["received get flush state request"] [request="segmentIDs:433864038933069825 segmentIDs:433864038933069826 "]
[2022/06/12 18:57:54.885 +00:00] [INFO] [impl.go:3587] ["received get flush state response"] [response="status:<> "]

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6851761481?check_suite_focus=true

Milvus Log

failed job: https://github.com/milvus-io/milvus/runs/6851761481?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/6898913792/artifacts/267696164

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 13, 2022
@yanliang567
Copy link
Contributor

@XuanYang-cn one more flush hang issue...
/assign @XuanYang-cn
/unassign

@yanliang567 yanliang567 added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jun 13, 2022
@yanliang567 yanliang567 added this to the 2.1-RC1 milestone Jun 13, 2022
@XuanYang-cn
Copy link
Contributor

may related to #17335, working on it

@XuanYang-cn
Copy link
Contributor

/assign @zhuwenxing
Could you please have another try with the latest master commit?

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Jun 13, 2022

It still happened in version master-20220613-e6225d92
Your PR is merged in the version

failed job: https://github.com/zhuwenxing/milvus/runs/6857375279?check_suite_focus=true
but the log is not collected due to the job is timeout, not the failure

I will try it again

@zhuwenxing
Copy link
Contributor Author

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 13, 2022
@zhuwenxing
Copy link
Contributor Author

@yanliang567 yanliang567 removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 14, 2022
@XuanYang-cn
Copy link
Contributor

XuanYang-cn commented Jun 14, 2022

15143 time="2022-06-13T10:27:40Z" level=info msg="[Connected consumer]" consumerID=2 name=rmflb subscription=by-dev-dataNode-11-433878722880733185 topic="persistent://public/default/by-dev-rootcoord-dml_2"
15144 time="2022-06-13T10:27:40Z" level=warning msg="Closed connection unable add consumer with id=2" local_addr="10.97.4.243:47140" remote_addr="pulsar://pulsar-pod-kill-6447-proxy:6650"
15145 time="2022-06-13T10:27:40Z" level=info msg="[Reconnected consumer to broker]" consumerID=2 name=rmflb subscription=by-dev-dataNode-11-433878722880733185 topic="persistent://public/default/by-dev-rootcoord-dml_2"
15146 time="2022-06-13T10:27:40Z" level=warning msg="Connection closed unable register listener id=24" local_addr="10.97.4.243:47140" remote_addr="pulsar://pulsar-pod-kill-6447-proxy:6650"

The same reason as #14577, this time it's DataNode's consumer didn't reconnected with pulsar

Related issues:

apache/pulsar-client-go#785

apache/pulsar-client-go#733

apache/pulsar-client-go#698

@XuanYang-cn
Copy link
Contributor

/assign @xige-16

@XuanYang-cn
Copy link
Contributor

failed job: https://github.com/zhuwenxing/milvus/runs/6858068872?check_suite_focus=true log: https://github.com/zhuwenxing/milvus/suites/6904892315/artifacts/268116140

In GitHub action, this issue seems stable reproduced

 3955 time="2022-06-13T09:06:13Z" level=info msg="[Connected consumer]" consumerID=2 name=tiuxt subscription=by-dev-dataNode-10-433877445462920513 topic="persistent://public/default/by-dev-rootcoord-dml_0"
 3956 time="2022-06-13T09:06:13Z" level=warning msg="Closed connection unable add consumer with id=2" local_addr="10.244.0.8:39730" remote_addr="pulsar://test-pulsar-pod-kill-proxy:6650"
 3957 time="2022-06-13T09:06:13Z" level=info msg="[Reconnected consumer to broker]" consumerID=2 name=tiuxt subscription=by-dev-dataNode-10-433877445462920513 topic="persistent://public/default/by-dev-rootcoord-dml_0"
 3958 time="2022-06-13T09:06:13Z" level=info msg="[Connected consumer]" consumerID=46 name=cierj subscription=by-dev-dataNode-10-433877462149496833 topic="persistent://public/default/by-dev-rootcoord-dml_45"
 3959 time="2022-06-13T09:06:13Z" level=warning msg="Closed connection unable add consumer with id=46" local_addr="10.244.0.8:39730" remote_addr="pulsar://test-pulsar-pod-kill-proxy:6650"

@XuanYang-cn
Copy link
Contributor

/unassign @xige-16 @XuanYang-cn
/assign @sunby

Please help to fix this

@sre-ci-robot sre-ci-robot assigned sunby and unassigned xige-16 and XuanYang-cn Jun 14, 2022
@zhuwenxing
Copy link
Contributor Author

@sunby
Copy link
Contributor

sunby commented Jun 17, 2022

/assign @zhuwenxing
please verify it.

@zhuwenxing
Copy link
Contributor Author

@sunby
Copy link
Contributor

sunby commented Jun 21, 2022

Fixed in #17642

@xiaofan-luan
Copy link
Collaborator

@zhuwenxing pls help on verifying it

@zhuwenxing
Copy link
Contributor Author

@sunby
Copy link
Contributor

sunby commented Jun 22, 2022

There is a bug in datacoord. Consider this situation, a segment is allocated by datacoord and the proxy tries to insert some records to it. But pulsar is killed at this moment and this segment is still empty. After calling Flush on proxy, datacoord will retrun a segment list containing segments that are waiting to be flushed. But datacoord won't flush empty segment, so this segment's state is always Sealed and the flush hang.

I will fix it by setting the state of segment to Dropped if datacoord find this segment is empty. And the GetFlushState will check the segment whether is empty.

Thanks for the great job. @zhuwenxing

@sunby
Copy link
Contributor

sunby commented Jun 24, 2022

/unassign

@sunby
Copy link
Contributor

sunby commented Jun 24, 2022

/assign @zhuwenxing please verify

@zhuwenxing
Copy link
Contributor Author

Not reproduced yet, remove the critical label.

@zhuwenxing zhuwenxing removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 27, 2022
@stale
Copy link

stale bot commented Jul 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jul 27, 2022
@stale stale bot closed this as completed Aug 3, 2022
@xiaofan-luan
Copy link
Collaborator

I have one question about this issue.
By flushing(we should definitely name it as sync, flush is more like seal the segment), user want to ensure all the inserted data is synced to Minio.
After the change the semantic has been breaked, we have no guarantee that inserted data is on minio, which may cause data loss when you do backup or migration.
My suggestion is:

  1. Flush should still return all known segment list related
  2. Handle the case of empty segment flush to avoid hang, any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

6 participants