Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is generation in clusterNodes.GC() too old? #3284

Open
DingYuan0118 opened this issue Feb 19, 2025 · 1 comment
Open

Is generation in clusterNodes.GC() too old? #3284

DingYuan0118 opened this issue Feb 19, 2025 · 1 comment

Comments

@DingYuan0118
Copy link

DingYuan0118 commented Feb 19, 2025

Discussed in #3283

Originally posted by DingYuan0118 February 19, 2025

version: github.com/go-redis/redis/v8 v8.11.5

Since redis cluster client start to reload state every 10 seconds (Assume traffic always exists) likes below

func (c *clusterStateHolder) Get(ctx context.Context) (*clusterState, error) {
	v := c.state.Load()
	if v == nil {
		return c.Reload(ctx)
	}

	state := v.(*clusterState)
	if time.Since(state.createdAt) > 10*time.Second {
		c.LazyReload()
	}
	return state, nil
}

and do GC one minute after the Reload() like:

func newClusterState(
	nodes *clusterNodes, slots []ClusterSlot, origin string,
) (*clusterState, error) {
	c := clusterState{
		nodes: nodes,

		slots: make([]*clusterSlot, 0, len(slots)),

		generation: nodes.NextGeneration(),
		createdAt:  time.Now(),
	}

	originHost, _, _ := net.SplitHostPort(origin)
	isLoopbackOrigin := isLoopback(originHost)

	for _, slot := range slots {
              ......
	}
	sort.Sort(clusterSlotSlice(c.slots))

	time.AfterFunc(time.Minute, func() {
		nodes.GC(c.generation)
	})

	return &c, nil
}

When it starts to do the GC, the generation it uses is from 1 minute ago. This will result in a gap between clusterNode.generation and GC generation and causes activeAddrs are not updated in time. I add some support log to verify this:

// GC removes unused nodes.
func (c *clusterNodes) GC(generation uint32) {
	//nolint:prealloc
	var collected []*clusterNode

	c.mu.Lock()
	fmt.Printf("time %s: GC, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), generation)  // logs
	fmt.Printf("time %s: GC, current nodes: %d\n", time.Now().Format("2006-01-02 15:04:05"), len(c.nodes)) // logs

	c.activeAddrs = c.activeAddrs[:0]
	for addr, node := range c.nodes {
		fmt.Printf("time %s: GC, node addr: %s, generation: %d\n", time.Now().Format("2006-01-02 15:04:05"), addr, node.Generation())
		if node.Generation() >= generation {
			c.activeAddrs = append(c.activeAddrs, addr)
			if c.opt.RouteByLatency {
				go node.updateLatency()
			}
			continue
		}

		delete(c.nodes, addr)
		collected = append(collected, node)
	}
	fmt.Printf("time %s: GC, current activeAddrs: %v\n", time.Now().Format("2006-01-02 15:04:05"), c.activeAddrs) // logs
	c.mu.Unlock()

	for _, node := range collected {
		_ = node.Client.Close()
	}
}

output like

time 2025-02-19 09:01:44: GC, generation: 1
time 2025-02-19 09:01:44: GC, current nodes: 7
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.12:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.16:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.11:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.15:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.13:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: 172.19.0.14:6379, generation: 6
time 2025-02-19 09:01:44: GC, node addr: redis.example.com:6379, generation: 0
time 2025-02-19 09:01:44: GC, current activeAddrs: [172.19.0.12:6379 172.19.0.16:6379 172.19.0.11:6379 172.19.0.15:6379 172.19.0.13:6379 172.19.0.14:6379]

We can find that nodes generation(which is 6) is greater than GC generation(which is 1).

If I wanna switch a redis cluster by operating the DNS record and shutting down the old cluster, it will take one minute to make the activeAddrs be the new cluster IPs. If the old cluster reboot in this one minute, it will causes the real connection drift between the two clusters.

I wanna check is it an expected action or to prevent some problems?

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 27, 2025

Hello @DingYuan0118, thank you for reaching out. We are officially supporting v9+ at the moment. If I have the bandwidth I will review this issue in the following weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants