Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Safepoint with infinite TTL may keep disappearing after upgrading PD #3366

Closed
lichunzhu opened this issue Jan 19, 2021 · 3 comments · Fixed by #3371
Closed

Service Safepoint with infinite TTL may keep disappearing after upgrading PD #3366

lichunzhu opened this issue Jan 19, 2021 · 3 comments · Fixed by #3371
Labels
severity/major type/bug The issue is confirmed as a bug.

Comments

@lichunzhu
Copy link

lichunzhu commented Jan 19, 2021

Bug Report

An extended problem of #3128

What did you do?

Upgrade from PD from v4.0.8 to v4.0.9.

What did you expect to see?

PD should contain service_gc_safe_points gc_worker in PD etcd key.

Expected:

./pd-ctl service-gc-safepoint
{
  "service_gc_safe_points": [
    {
      "service_id": "gc_worker",
      "expired_at": 9223372036854775807,
      "safe_point": 422298434977136640
    }
    ...
}

What did you see instead?

./pd-ctl service-gc-safepoint
{
  "service_gc_safe_points": [
    {
      "service_id": "ticdc",
      "expired_at": 1612125548,
      "safe_point": 422298434977136640
    }
  ],
  "gc_safe_point": 422141635726409728
}

service_id: "gc_worker" is still lost.

What version of PD are you using (pd-server -V)?

Release Version: v4.0.9
Edition: Community
Git Commit Hash: 328752221729cca17b618fca75d161865e4da736
Git Branch: heads/refs/tags/v4.0.9
UTC Build Time:  2020-12-21 03:55:3
@lichunzhu lichunzhu added the type/bug The issue is confirmed as a bug. label Jan 19, 2021
@lichunzhu
Copy link
Author

Root cause:

Although we have fixed #3128, when we upgrade from a lower PD version to a higher one, TiDB still cannot add gc_worker key in PD because PD won't accept a smaller safe-ttl.

if request.TTL > 0 && request.SafePoint >= min.SafePoint {

This will make UpdateGCSafePoint keep responding wrong safe point value which may cause some problems to the client who uses this interface, for example, TiCDC.

Solution:

  1. PD should check whether do PD has 'gc_worker' key in etcd key. If not, we should create one and set it to current min safePoint. It's safe because TiDB will always try to update this key when it trys to do a GC job.
  2. We should have a better overflow check instead of this:
    if request.TTL == math.MaxInt64 {

    When we set TTL to a big value instead of math.MaxInt64, this code will still form a negative ExpiredAt value. I think check now.Unix() with math.MaxInt64-request.TTL will simply fix this problem. When it's true, we can just set ExpiredAt to math.MaxInt64.

@disksing
Copy link
Contributor

Please take a look @MyonKeminta

@MyonKeminta
Copy link
Contributor

I see. I did a fix for the gc_worker in #3146 but it doesn't work for upgraded cluster where there exists other alive service safepoints. If other safepoints are deleted before gc_worker's safepoint being successfully written, it's still possible that a regressing safepoint is incorrectly accepted.
Thanks for reporting. I'll fix it later today or tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants