Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidb panic when injection pd leader io delay #8781

Closed
Lily2025 opened this issue Nov 7, 2024 · 3 comments · Fixed by #8783
Closed

tidb panic when injection pd leader io delay #8781

Lily2025 opened this issue Nov 7, 2024 · 3 comments · Fixed by #8783
Assignees
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. impact/panic severity/major type/bug The issue is confirmed as a bug.

Comments

@Lily2025
Copy link

Lily2025 commented Nov 7, 2024

Bug Report

What did you do?

1、run sysbench
2、add index for one table
3、injection pd leader io delay
tidb-2024-11-04T11-37-04.520.log.zip

What did you expect to see?

no panic

What did you see instead?

tidb abnormally restart due to internal panic
[2024/11/04 10:53:00.821 +08:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [2024/11/04 10:53:03.033 +08:00] [INFO] [meminfo.go:179] ["use cgroup memory hook because TiDB is in the container"] [2024/11/04 10:53:03.034 +08:00] [INFO] [printer.go:52] ["Welcome to TiDB."] ["Release Version"=v8.5.0-alpha-31-g241c10294b] [Edition=Community] ["Git Commit Hash"=241c10294b9c0305db590576700c00de306be573] ["Git Branch"=HEAD] ["UTC Build Time"="2024-11-01 16:29:12"] [GoVersion=go1.23.2] ["Race Enabled"=false] ["Check Table Before Drop"=false]

What version of PD are you using (pd-server -V)?

./pd-server -V
Release Version: v8.5.0-alpha-3-g60b960a4
Edition: Community
Git Commit Hash: 60b960a
Git Branch: HEAD
UTC Build Time: 2024-11-01 10:24:22
2024-11-04T10:31:24.388+0800

@Lily2025 Lily2025 added the type/bug The issue is confirmed as a bug. label Nov 7, 2024
@Lily2025
Copy link
Author

Lily2025 commented Nov 7, 2024

/type bug
/severity major
/assign lhy1024

@lhy1024
Copy link
Contributor

lhy1024 commented Nov 7, 2024

After #8744, we support dynamic check tso service, which make startTSOJobsIfNeeded happen after stopTSOJobsIfNeeded

  1. tso sync failed and reset
[2024/11/04 10:32:44.337 +08:00] [WARN] [tso.go:371] ["save timestamp failed"] [] [dc-location=global] [timestamp-path=timestamp] [error="context deadline exceeded"]
[2024/11/04 10:32:44.337 +08:00] [WARN] [allocator_manager.go:781] ["failed to update allocator's timestamp"] [] [dc-location=global] [name=tc-pd-2] [error="context deadline exceeded"]
[2024/11/04 10:32:44.337 +08:00] [INFO] [tso.go:437] ["reset the timestamp in memory"] []
  1. leader switch because keep alive failed and close service, I guess it also has run stopTSOJobsIfNeeded
[2024/11/04 10:32:44.339 +08:00] [WARN] [lease.go:185] ["lease keep alive failed"] [purpose="leader election"] [start=2024/11/04 10:32:43.758 +08:00] [error="context canceled"]
[2024/11/04 10:32:44.339 +08:00] [INFO] [lease.go:202] ["stop lease keep alive worker"] [purpose="leader election"]
[2024/11/04 10:32:44.377 +08:00] [INFO] [server.go:1777] ["no longer a leader because lease has expired, pd leader will step down"]
  1. checkTSOService call startTSOJobsIfNeeded , although it was not leader
[2024/11/04 10:32:49.350 +08:00] [INFO] [cluster.go:476] ["initializing the global TSO allocator"]
[2024/11/04 10:32:49.352 +08:00] [INFO] [tso.go:161] ["start to sync timestamp"] []
[2024/11/04 10:32:49.356 +08:00] [INFO] [tso.go:221] ["sync and save timestamp"] [] [last=2024/11/04 10:32:37.334 +08:00] [last-saved=0001/01/01 00:00:00.000 +00:00] [save=2024/11/04 10:32:52.355 +08:00] [next=2024/11/04 10:32:49.355 +08:00]
[2024/11/04 10:32:49.359 +08:00] [INFO] [cluster.go:459] ["service check job is stopped"]
[2024/11/04 10:33:19.423 +08:00] [INFO] [cluster.go:862] ["raft cluster is stopped"]
  1. it became leader again and no longer initialize tso until next sync
[2024/11/04 10:53:00.806 +08:00] [INFO] [server.go:1768] ["PD leader is ready to serve"] [leader-name=tc-pd-2]
....
[2024/11/04 10:53:00.834 +08:00] [WARN] [tso.go:335] ["clock offset"] [] [jet-lag=5m0.968119947s] [prev-physical=2024/11/04 10:47:59.866 +08:00] [now=2024/11/04 10:53:00.834 +08:00] [update-physical-interval=50ms]
  1. If client call tso will meet fallback
[2024/11/04 10:53:00.821 +08:00] [PANIC] [tso_dispatcher.go:647] ["[tso] timestamp fallback"] [dc-location=global] [keyspace=4294967295] [last-ts="(1730688755816, 13414)"] [cur-ts="(1730688479866, 1)"] [last-tso-server=tc-pd-1.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379] [cur-tso-server=tc-pd-2.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379] [last-keyspace-group-in-request=0] [cur-keyspace-group-in-request=0] [last-keyspace-group-in-response=0] [cur-keyspace-group-in-response=0] [last-response-received-at=2024/11/04 10:52:40.175 +08:00] [cur-response-received-at=2024/11/04 10:53:00.821 +08:00] [last-stream-id=tc-pd-1.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379-274] [cur-stream-id=tc-pd-2.tc-pd-peer.endless-ha-test-add-index-tps-7651740-1-143.svc:2379-522]
[2024/11/04 10:53:00.821 +08:00] [FATAL] [tso_stream.go:330] ["tsoStream.recvLoop internal panic"] [stacktrace="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"] [panicMessage="[tso] timestamp fallback"] [stack="github.com/tikv/pd/client.(*tsoStream).recvLoop.func1\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:330\nruntime.gopanic\n\t/usr/local/go/src/runtime/panic.go:785\ngo.uber.org/zap/zapcore.CheckWriteAction.OnWrite\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:196\ngo.uber.org/zap/zapcore.(*CheckedEntry).Write\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/zapcore/entry.go:262\ngo.uber.org/zap.(*Logger).Panic\n\t/root/go/pkg/mod/go.uber.org/zap@v1.27.0/logger.go:285\ngithub.com/pingcap/log.Panic\n\t/root/go/pkg/mod/github.com/pingcap/log@v1.1.1-0.20240314023424-862ccc32f18d/global.go:54\ngithub.com/tikv/pd/client.(*tsoDispatcher).checkMonotonicity\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:647\ngithub.com/tikv/pd/client.(*tsoDispatcher).processRequests.func2\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_dispatcher.go:597\ngithub.com/tikv/pd/client.(*tsoStream).recvLoop\n\t/root/go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20241016064947-b70107ec31e6/tso_stream.go:444"]

last-ts 1730688479866 is 2024-11-04 10:52:35.816 and cur-ts 1730688479866 is 2024-11-04 10:47:59.866, which is the same with pd-server log

@lhy1024 lhy1024 added the affects-8.5 This bug affects the 8.5.x(LTS) versions. label Nov 7, 2024
@okJiang
Copy link
Member

okJiang commented Nov 7, 2024

Supplementary code screenshots for better understanding.

  1. Entered checkTSOService, at this point ctx is not yet Done.
    img_v3_02gd_4ab2c4f2-438d-4bc6-b17c-8ec2c8d1e52g
  2. The RaftCluster called Stop and canceled the context.
    img_v3_02gd_a50fe763-634b-4675-b61a-7e9224ee529g
  3. Stop called StopTSOJob.
  4. checkTSOService called StartTSOJob,

Caused TSO to not shut down properly.

@ti-chi-bot ti-chi-bot bot closed this as completed in #8783 Nov 7, 2024
@ti-chi-bot ti-chi-bot bot closed this as completed in b19cec5 Nov 7, 2024
ti-chi-bot bot pushed a commit that referenced this issue Nov 7, 2024
close #8781

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
ti-chi-bot bot pushed a commit that referenced this issue Nov 8, 2024
ref #8781

Signed-off-by: lhy1024 <admin@liudos.us>
ti-chi-bot bot pushed a commit that referenced this issue Nov 8, 2024
ref #8781

Signed-off-by: lhy1024 <admin@liudos.us>

Co-authored-by: lhy1024 <admin@liudos.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-8.5 This bug affects the 8.5.x(LTS) versions. impact/panic severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants