scheduler database contains same hostname in multiple "active" state #3566

succa · 2024-10-10T10:31:22Z

Bug report:

scheduler database contains same hostname in multiple "active" state

mysql> select * from scheduler where host_name="dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local" order by id;

+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| id  | created_at          | updated_at          | is_del | host_name                                                   | idc  | location | ip              | port | state    | features               | scheduler_cluster_id |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+
| 102 | 2024-07-11 23:16:59 | 2024-07-24 19:17:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.76.106  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 113 | 2024-07-24 19:17:53 | 2024-08-01 17:42:18 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.21.36   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 141 | 2024-08-01 17:42:29 | 2024-08-13 15:07:32 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.82.18   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 157 | 2024-08-13 15:07:38 | 2024-08-19 13:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.131.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 173 | 2024-08-19 13:43:31 | 2024-08-23 00:34:01 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.28.247  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 187 | 2024-08-23 00:34:07 | 2024-08-26 10:34:39 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.156.224.51  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 194 | 2024-08-26 10:35:01 | 2024-09-04 06:43:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.188.97  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 206 | 2024-09-04 06:43:33 | 2024-09-10 23:27:24 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.154.180.220 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 217 | 2024-09-10 23:28:07 | 2024-09-10 23:28:07 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.63.74   | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 218 | 2024-09-10 23:28:26 | 2024-09-13 01:39:23 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.112.122 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 228 | 2024-09-13 01:39:38 | 2024-09-21 02:37:27 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.125.176 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 237 | 2024-09-21 02:37:47 | 2024-09-24 17:47:28 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.159.130.24  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 249 | 2024-09-24 17:47:59 | 2024-09-25 02:29:57 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.157.96.143  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 254 | 2024-09-25 02:30:10 | 2024-10-03 14:26:41 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.152.105.34  | 8002 | active   | ["schedule","preheat"] |                    1 |
| 264 | 2024-10-03 17:39:02 | 2024-10-03 17:39:08 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.20.78   | 8002 | active   | ["schedule","preheat"] |                    1 |
| 265 | 2024-10-03 18:15:14 | 2024-10-03 20:14:48 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.153.85.210  | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 266 | 2024-10-03 20:15:12 | 2024-10-09 19:01:06 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.158.244.225 | 8002 | inactive | ["schedule","preheat"] |                    1 |
| 283 | 2024-10-09 19:01:16 | 2024-10-10 09:02:42 |      0 | dragonfly-scheduler-7.scheduler.dragonfly.svc.cluster.local |      |          | 241.155.231.191 | 8002 | active   | ["schedule","preheat"] |                    1 |
+-----+---------------------+---------------------+--------+-------------------------------------------------------------+------+----------+-----------------+------+----------+------------------------+----------------------+

Notice also the strange jump in time between a old entry in error "active" state and subsequent entry
This is preventing peers to use this scheduler pod because they are using a wrong ip.

Expected behavior:

There should be only one active entry per host_name at any point in time.

How to reproduce it:

Not able to reproduce, I guess it is happening when the scheduler database is being updated

Environment:

Dragonfly version: 2.1.50

The text was updated successfully, but these errors were encountered:

gaius-qi · 2024-10-10T10:37:10Z

@succa This will happen if the scheduler instance is force deleted. Or this situation can also occur if the manager service is unavailable when the scheduler is deleted.

succa · 2024-10-10T10:38:32Z

@gaius-qi Thanks for the very quick answer!
Is there a fix to it? My scheduler pods are not long live pods due to cluster node rotation

gaius-qi · 2024-10-16T03:01:16Z

@succa It is necessary to ensure that there are active instances of the manager during the upgrade scheduler process.

succa · 2024-10-17T08:22:10Z

@gaius-qi I have 10 running instances all the time. I ended up creating a cronjob to cleanup the database, but this is something you might want to consider adding in the code directly as a safe check by the manager

succa added the bug label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler database contains same hostname in multiple "active" state #3566

scheduler database contains same hostname in multiple "active" state #3566

succa commented Oct 10, 2024 •

edited

Loading

gaius-qi commented Oct 10, 2024

succa commented Oct 10, 2024

gaius-qi commented Oct 16, 2024

succa commented Oct 17, 2024

scheduler database contains same hostname in multiple "active" state #3566

scheduler database contains same hostname in multiple "active" state #3566

Comments

succa commented Oct 10, 2024 • edited Loading

Bug report:

Expected behavior:

How to reproduce it:

Environment:

gaius-qi commented Oct 10, 2024

succa commented Oct 10, 2024

gaius-qi commented Oct 16, 2024

succa commented Oct 17, 2024

succa commented Oct 10, 2024 •

edited

Loading