PD panics when list resource-group with some resource group defined. #7206

AndreMouche · 2023-10-16T03:53:06Z

Bug Report

What did you do?

create some resource group
and try to list them.

What did you expect to see?

No panic and get all resource groups

What did you see instead?

PD panic

panic: json: unsupported value: NaN

goroutine 331601 [running]:
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*ResourceGroup).Copy(0x40179af5d8?)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/resource_group.go:68 +0x13c
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Manager).GetResourceGroupList(0x4000511ec0)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/manager.go:245 +0x124
github.com/tikv/pd/pkg/mcs/resource_manager/server.(*Service).ListResourceGroups(0x4000209638?, {0x402a3abc80?, 0x3?}, 0x3?)
    /mnt/data1/jenkins/workspace/build-common@2/go/src/github.com/pingcap/pd/pkg/mcs/resource_manager/server/grpc_service.go:114 +0x74
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_ListResourceGroups_Handler.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0?, 0x4035da1b80})
    /root/go/pkg/mod/github.com/pingcap/kvproto@v0.0.0-20230407040905-68d0eebd564a/pkg/resource_manager/resource_manager.pb.go:1868 +0x74
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:31 +0x9c
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).UnaryServerInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x20?, 0x401cdd8e10)
    /root/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:107 +0x74
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 +0x74
go.etcd.io/etcd/etcdserver/api/v3rpc.newUnaryInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0?, 0x4035da1b80}, 0x3366471900000000?, 0x401cdd8e10)
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20220915004622-85b640cee793/etcdserver/api/v3rpc/interceptor.go:70 +0x2c4
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1({0x3891a90?, 0x402a3abb30?}, {0x2a33aa0?, 0x4035da1b80?})
    /root/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:34 +0x74
go.etcd.io/etcd/etcdserver/api/v3rpc.newLogUnaryInterceptor.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x4035da1ba0, 0x401cdd8e10)
    /root/go/pkg/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20220915004622-85b640cee793/etcdserver/api/v3rpc/interceptor.go:77 +0x80
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x3891a90, 0x402a3abb30}, {0x2a33aa0, 0x4035da1b80}, 0x4035da1ba0, 0x40281687c8)
    /root/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.0.1-0.20190118093823-f849b5445de4/chain.go:39 +0x17c
github.com/pingcap/kvproto/pkg/resource_manager._ResourceManager_ListResourceGroups_Handler({0x29e8040?, 0x4000209638}, {0x3891a90, 0x402a3abb30}, 0x4035caaae0, 0x400157c1b0)
    /root/go/pkg/mod/github.com/pingcap/kvproto@v0.0.0-20230407040905-68d0eebd564a/pkg/resource_manager/resource_manager.pb.go:1870 +0x12c
google.golang.org/grpc.(*Server).processUnaryRPC(0x4001cf2480, {0x38a10e0, 0x402dc29800}, 0x400cd0c300, 0x400243f590, 0x48e45c0, 0x0)
    /root/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1024 +0xb18
google.golang.org/grpc.(*Server).handleStream(0x4001cf2480, {0x38a10e0, 0x402dc29800}, 0x400cd0c300, 0x0)
    /root/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:1313 +0x854
google.golang.org/grpc.(*Server).serveStreams.func1.1()
    /root/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:722 +0x84
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /root/go/pkg/mod/google.golang.org/grpc@v1.26.0/server.go:720 +0xdc
Stream closed EOF for prod-tidb/prod-redash-pd-0 (pd)

Deleting all resource groups stops the panics.

What version of PD are you using (`pd-server -V`)?

v7.1.0

The text was updated successfully, but these errors were encountered:

hongshaoyang · 2023-12-26T03:43:20Z

I am facing this issue in one of our TiDB 7.1 cluster.

Looking at the stacktrace, could it be related to Prometheus scraping of metrics?

hongshaoyang · 2023-12-26T04:08:42Z

This is the snippet causing the issue, it has to do with json.Marshal not serializing the ResourceGroup struct properly.

pd/pkg/mcs/resourcemanager/server/resource_group.go

Lines 64 to 78 in 8950c3a

    
           func (rg *ResourceGroup) Copy() *ResourceGroup { 
        
           	// TODO: use a better way to copy 
        
           	rg.RLock() 
        
           	defer rg.RUnlock() 
        
           	res, err := json.Marshal(rg) 
        
           	if err != nil { 
        
           		panic(err) 
        
           	} 
        
           	var newRG ResourceGroup 
        
           	err = json.Unmarshal(res, &newRG) 
        
           	if err != nil { 
        
           		panic(err) 
        
           	} 
        
           	return &newRG 
        
           }

CabinfeverB · 2023-12-26T04:30:47Z

*ResourceGroup) Copy() *ResourceGroup {

Yes, but we don't know how some fields changes to NaN.
Can u help reproduce it?
cc @nolouch

CabinfeverB · 2023-12-26T04:33:28Z

cc @glorv

hongshaoyang · 2023-12-26T05:58:35Z

Yes, but we don't know how some fields changes to NaN. Can u help reproduce it? cc @nolouch

Yes, sure, here is the list of resource groups that we used.

CabinfeverB · 2023-12-26T06:12:17Z

@hongshaoyang After PD panic, does PD panic again when listing the resource groups again?

hongshaoyang · 2023-12-26T10:56:51Z

@hongshaoyang After PD panic, does PD panic again when listing the resource groups again?

@CabinfeverB Yes, the PD panics again repeatedly. The TiDB cluster is deployed on Kubernetes. The PD pods keeps going into CrashLoopBackOff. The logs show the same stacktrace. This implies that there is some hidden process that is listing the resource groups repeatedly.

It is not a human running the resource groups listing as the PD pods crashed outside of office hours, when there were no changes to resource groups or their configurations.

nolouch · 2023-12-27T03:48:29Z

@hongshaoyang
How often does it panic？ could you help us export some data with the command:

curl -sl  http://{pd-leader-ip}:{pd-port}/resource-manager/api/v1/config/groups | jq ".[].r_u_settings" > data.json

hongshaoyang · 2023-12-27T08:22:29Z

Here is the r_u_settings data:

{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"initialized":false}}}
{"r_u":{"settings":{"fill_rate":2147483647,"burst_limit":-1},"state":{"tokens":29860685960413220,"last_update":"2023-12-27T08:19:16.269363735Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:17.269332808Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:05.143659794Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":-28216.64129750421,"last_update":"2023-12-27T08:19:16.4862813Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:17.420912119Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":1163.6089377586882,"last_update":"2023-12-27T08:19:15.252112524Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:10.78038052Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":11678.85839950952,"last_update":"2023-12-27T08:19:16.269380275Z","initialized":true}}}
{"r_u":{"settings":{"fill_rate":14000,"burst_limit":14000},"state":{"tokens":14000,"last_update":"2023-12-27T08:19:09.270797923Z","initialized":true}}}

hongshaoyang · 2023-12-27T08:26:09Z

#7206 (comment)

It panics every 5-8 days, not sure why it is such an infrequent occurence. The only solution is to drop all resource groups.

close #7206 resource_mananger: deep clone resource group Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: tongjian <1045931706@qq.com>

close tikv#7206 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

close #7206 resource_mananger: deep clone resource group Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io> Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: ShuNing <nolouch@gmail.com> Co-authored-by: nolouch <nolouch@gmail.com>

…7626) ref #7206 Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…7626) (#7658) ref #7206 Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>

…7626) (#7657) ref #7206 Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com>

…ikv#7626) ref tikv#7206 Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com> Signed-off-by: pingandb <songge102@pingan.com.cn>

nolouch · 2024-03-01T01:56:51Z

fixed. Cannot reproduce the NaN problem, but we replace a new way to copy the data, so this issue should be fixed.

seiya-annie · 2024-06-04T02:44:19Z

/found customer

AndreMouche added the type/bug The issue is confirmed as a bug. label Oct 16, 2023

AndreMouche assigned CabinfeverB Oct 16, 2023

jebter added the severity/major label Oct 16, 2023

ti-chi-bot bot added may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Oct 16, 2023

CabinfeverB added affects-7.1 and removed may-affects-5.3 may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Oct 16, 2023

ti-chi-bot added the affects-7.5 label Oct 23, 2023

nolouch mentioned this issue Dec 27, 2023

resource_mananger: deep clone resource group #7623

Merged

ti-chi-bot bot closed this as completed in #7623 Dec 27, 2023

ti-chi-bot bot pushed a commit that referenced this issue Dec 27, 2023

resource_mananger: deep clone resource group (#7623)

ed9685a

close #7206 resource_mananger: deep clone resource group Signed-off-by: nolouch <nolouch@gmail.com> Co-authored-by: tongjian <1045931706@qq.com>

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Dec 27, 2023

This is an automated cherry-pick of tikv#7623

6d0a697

close tikv#7206 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

This was referenced Dec 27, 2023

resource_mananger: deep clone resource group (#7623) #7624

Merged

resource_mananger: deep clone resource group (#7623) #7625

Merged

ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Dec 27, 2023

This is an automated cherry-pick of tikv#7623

01fea0d

close tikv#7206 Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

CabinfeverB mentioned this issue Dec 27, 2023

resource_group: don't accumulate tokens when burstlimit less than 0 #7626

Merged

nolouch reopened this Dec 27, 2023

This was referenced Jan 3, 2024

resource_group: don't accumulate tokens when burstlimit less than 0 (#7626) #7657

Merged

resource_group: don't accumulate tokens when burstlimit less than 0 (#7626) #7658

Merged

ti-chi-bot bot pushed a commit that referenced this issue Jan 4, 2024

resource_group: don't accumulate tokens when burstlimit less than 0 (#…

79aac73

…7626) (#7657) ref #7206 Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com> Co-authored-by: Cabinfever_B <cabinfeveroier@gmail.com>

niubell mentioned this issue Feb 26, 2024

add release note for v7.5.1 pingcap/docs-cn#16660

Merged

18 tasks

nolouch closed this as completed Mar 1, 2024

HuSharp mentioned this issue Mar 4, 2024

add release notes for v7.1.4 pingcap/docs-cn#16696

Merged

18 tasks

ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024

github-project-automation bot added this to Questions and Bug Reports Aug 29, 2024

github-project-automation bot moved this to Closed in Questions and Bug Reports Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD panics when list resource-group with some resource group defined. #7206

PD panics when list resource-group with some resource group defined. #7206

AndreMouche commented Oct 16, 2023

hongshaoyang commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023 •

edited

Loading

CabinfeverB commented Dec 26, 2023

CabinfeverB commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023

CabinfeverB commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023 •

edited

Loading

nolouch commented Dec 27, 2023

hongshaoyang commented Dec 27, 2023

hongshaoyang commented Dec 27, 2023 •

edited

Loading

nolouch commented Mar 1, 2024 •

edited

Loading

seiya-annie commented Jun 4, 2024

PD panics when list resource-group with some resource group defined. #7206

PD panics when list resource-group with some resource group defined. #7206

Comments

AndreMouche commented Oct 16, 2023

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

hongshaoyang commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023 • edited Loading

CabinfeverB commented Dec 26, 2023

CabinfeverB commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023

CabinfeverB commented Dec 26, 2023

hongshaoyang commented Dec 26, 2023 • edited Loading

nolouch commented Dec 27, 2023

hongshaoyang commented Dec 27, 2023

hongshaoyang commented Dec 27, 2023 • edited Loading

nolouch commented Mar 1, 2024 • edited Loading

seiya-annie commented Jun 4, 2024

What version of PD are you using (`pd-server -V`)?

hongshaoyang commented Dec 26, 2023 •

edited

Loading

hongshaoyang commented Dec 26, 2023 •

edited

Loading

hongshaoyang commented Dec 27, 2023 •

edited

Loading

nolouch commented Mar 1, 2024 •

edited

Loading