Skip to content

Node count > 1000 causes failure to schedule #1825

@gisjedi

Description

@gisjedi

Description
When a Scale instance is running with a pre-existing database that includes over 1000 nodes, nothing will schedule if all ready instances are above the initial 1000 records returned. The problem appears to be related to the fact that Scale only requests the first 1000 records from the node table and so it is unable to match the offers to the nodes tracked in memory.

One potential solution is to just update the maximum records returned, the best solution would be to page over all the active nodes.

Reproduction Steps
Steps to reproduce the problem:

  1. Create at least 1000 nodes with IPs that aren't present in cluster in the nodes table
  2. Launch Scheduler and see offers incoming from new nodes
  3. Queue a couple test jobs and observe that they are never scheduled.

Metadata

Metadata

Assignees

Labels

bugreverifyissue need reverification

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions