Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error: context deadline exceeded #3172

Open
kwenzh opened this issue Oct 28, 2024 · 1 comment

Comments

@kwenzh
Copy link

kwenzh commented Oct 28, 2024

Issue tracker is used for reporting bugs and discussing new features. Please use
stackoverflow for supporting issues.

in a 3 node cluster, 3 sentinel + 3 redis-server, named: A 、B、C node, Construct C node network card goes offline, eg: ifconfig eth0 down, then the client reconnects to the Redis Sentinel to find the master address with func NewFailoverClient

Expected Behavior

  • redis-server failover , client can connect new master redis success

Current Behavior

  • Probability error: context deadline exceeded, when it try to connect C sentinel node, return err in https://github.com/redis/go-redis/blob/master/sentinel.go#L559, although A and B is work normaly, the context is deadline in this time, Because the faulty node C is placed in the first place during random sentinel addresses, C exhausts the context time, resulting in the immediate context timeout of A and B

image
image

Possible Solution

  • In obtaining the master address function, instead of using sequential joins for each sentinel address query you can consider concurrent goroutine queries, or use a separate context for each round of queries
  • Change the context of each iteration to be independent, use context.deadline to copy context
for i, sentinelAddr := range c.sentinelAddrs {
		sentinel := NewSentinelClient(c.opt.sentinelOptions(sentinelAddr))

		masterAddr, err := sentinel.GetMasterAddrByName(ctx, c.opt.MasterName).Result()
		if err != nil {
			_ = sentinel.Close()
			if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
				return "", err
			}
			internal.Logger.Printf(ctx, "sentinel: GetMasterAddrByName master=%q failed: %s",
				c.opt.MasterName, err)
			continue
		}

		// Push working sentinel to the top.
		c.sentinelAddrs[0], c.sentinelAddrs[i] = c.sentinelAddrs[i], c.sentinelAddrs[0]
		c.setSentinel(ctx, sentinel)

		addr := net.JoinHostPort(masterAddr[0], masterAddr[1])
		return addr, nil
	}

Steps to Reproduce

  1. deploy a 3 sentinel + 3 redis server cluster,
  2. make One of the node nics is offline and unreachable, eg ifconfig etho down
  3. The client connect redis cluster repeatedly with func NewFailoverClient
  4. Check whether the primary redis address can be obtained
  5. it seem error : context deadline exceeded,

Context (Environment)

  • centos8 with kernel: 4.18
  • go-redis: v9.6.0
  • ctx timeout: 3s,
  • dialTimeout: default 5s

Detailed Description

I think the point is,

  • The first point to get the primary address is, why query each node sequentially, so that the failed node in the front row may affect the healthy node in the back
  • Second, when repeated initialization, the random function is a pseudo-random, and the random seed is 1, which may lead to multiple rounds of repeated initialization results are the same, and it is always fixed for a certain failure, that is, when the faulty node is randomized to the first place

image
image

@kwenzh kwenzh changed the title Sentinel cluster set 1 node network iface down, unable to elect a master, context deadline exceeded Sentinel cluster settings 1 node network iface down, Probability unable to query the master node, MasterAddr error: context deadline exceeded Oct 28, 2024
@kwenzh
Copy link
Author

kwenzh commented Oct 29, 2024

  1. Simulating multiple random sentinel nodes, it can be observed that node C is randomly placed in the first position during the second simulation. Moreover, the results are the same in each round because it is pseudo-random with a seed of 1.
for cnt := 0; cnt < 10; cnt++ {
		arrs := []string{"A", "B", "C"}
		Shuffle(3, func(i, j int) {
			fmt.Println(">>>>>>>>", i, j)
			arrs[i], arrs[j] = arrs[j], arrs[i]
		})
		fmt.Println(">>>>>>>>", arrs)
	}

output:

>>>>>>>> 2 1
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 1     
>>>>>>>> 1 0     
>>>>>>>> [C A B] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 1     
>>>>>>>> 1 1     
>>>>>>>> [A C B] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0     
>>>>>>>> [B C A] 
>>>>>>>> 2 0     
>>>>>>>> 1 0                                                                    
>>>>>>>> [B C A]                                                                
>>>>>>>> 2 2                                                                    
>>>>>>>> 1 0                                                                    
>>>>>>>> [B A C]    
  1. Simulating multiple initializations of the sentinel, when node C fails, an error will occur in the second round of the loop, causing it to exit due to a context timeout.

func mock_sentinel() {
	for i := 0; i < 10; i++ {
		addr := []string{
			"A", "B", "C",
		}
		sent := redis.NewFailoverClient(&redis.FailoverOptions{
			SentinelAddrs: addr,
			MasterName: "mymaster",
		})
		ctx, cancel := context.WithTimeout(context.Background(), time.Second*3)
		defer cancel()
		_, err := sent.Ping(ctx).Result()
		if err != nil {
			panic(err)
		}
                fmt.Println("connect failover client ok", i)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant