Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler database contains same hostname in multiple "active" state #3566

Open
succa opened this issue Oct 10, 2024 · 4 comments
Open

scheduler database contains same hostname in multiple "active" state #3566

succa opened this issue Oct 10, 2024 · 4 comments
Labels

Comments

@succa
Copy link

succa commented Oct 10, 2024

Bug report:

scheduler database contains same hostname in multiple "active" state

mysql> select * from scheduler where host_name="dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local" order by id;

+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| id  | created_at          | updated_at          | is_del | host_name                                                   | idc  | location | ip              | port | state    | features               | scheduler_cluster_id |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| 102 | 2024-07-11 23:16:59 | 2024-07-24 19:17:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.76.106  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 113 | 2024-07-24 19:17:53 | 2024-08-01 17:42:18 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.21.36   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 141 | 2024-08-01 17:42:29 | 2024-08-13 15:07:32 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.82.18   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 157 | 2024-08-13 15:07:38 | 2024-08-19 13:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.131.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 173 | 2024-08-19 13:43:31 | 2024-08-23 00:34:01 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.28.247  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 187 | 2024-08-23 00:34:07 | 2024-08-26 10:34:39 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.156.224.51  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 194 | 2024-08-26 10:35:01 | 2024-09-04 06:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.188.97  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 206 | 2024-09-04 06:43:33 | 2024-09-10 23:27:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.180.220 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 217 | 2024-09-10 23:28:07 | 2024-09-10 23:28:07 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.63.74   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 218 | 2024-09-10 23:28:26 | 2024-09-13 01:39:23 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.112.122 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 228 | 2024-09-13 01:39:38 | 2024-09-21 02:37:27 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.125.176 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 237 | 2024-09-21 02:37:47 | 2024-09-24 17:47:28 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.159.130.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 249 | 2024-09-24 17:47:59 | 2024-09-25 02:29:57 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.96.143  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 254 | 2024-09-25 02:30:10 | 2024-10-03 14:26:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.105.34  | 8002 | active   | ["schedule","preheat"] |                    1 |
| 264 | 2024-10-03 17:39:02 | 2024-10-03 17:39:08 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.20.78   | 8002 | active   | ["schedule","preheat"] |                    1 |
| 265 | 2024-10-03 18:15:14 | 2024-10-03 20:14:48 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.153.85.210  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 266 | 2024-10-03 20:15:12 | 2024-10-09 19:01:06 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.244.225 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 283 | 2024-10-09 19:01:16 | 2024-10-10 09:02:42 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.231.191 | 8002 | active   | ["schedule","preheat"] |                    1 |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+

Notice also the strange jump in time between a old entry in error "active" state and subsequent entry
This is preventing peers to use this scheduler pod because they are using a wrong ip.

Expected behavior:

There should be only one active entry per host_name at any point in time.

How to reproduce it:

Not able to reproduce, I guess it is happening when the scheduler database is being updated

Environment:

  • Dragonfly version: 2.1.50
@succa succa added the bug label Oct 10, 2024
@gaius-qi
Copy link
Member

@succa This will happen if the scheduler instance is force deleted. Or this situation can also occur if the manager service is unavailable when the scheduler is deleted.

@succa
Copy link
Author

succa commented Oct 10, 2024

@gaius-qi Thanks for the very quick answer!
Is there a fix to it? My scheduler pods are not long live pods due to cluster node rotation

@gaius-qi
Copy link
Member

@succa It is necessary to ensure that there are active instances of the manager during the upgrade scheduler process.

@succa
Copy link
Author

succa commented Oct 17, 2024

@gaius-qi I have 10 running instances all the time. I ended up creating a cronjob to cleanup the database, but this is something you might want to consider adding in the code directly as a safe check by the manager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants