Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphs are showing incorrect value for queued jobs #66

Open
ericfranz opened this issue Nov 8, 2018 · 6 comments
Open

Graphs are showing incorrect value for queued jobs #66

ericfranz opened this issue Nov 8, 2018 · 6 comments
Assignees
Labels

Comments

@ericfranz
Copy link
Contributor

It reports "Queued" for "Eligible" and in parens (and 2000 blocked jobs)

efranz@owens-login02:~$ showq -s --xml; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"

<Data><Object>queue</Object><cluster LocalActiveNodes="757" LocalAllocProcs="19319" LocalConfigNodes="823" LocalIdleNodes="60" LocalIdleProcs="3829" LocalUpNodes="817" LocalUpProcs="23148" RemoteActiveNodes="0" RemoteAllocProcs="0" RemoteConfigNodes="0" RemoteIdleNodes="0" RemoteIdleProcs="0" RemoteUpNodes="0" RemoteUpProcs="0" time="1541714186"/><queue count="1237" option="active"/><queue count="445" option="eligible"/><queue count="2002" option="blocked"/></Data>

Queued 2008
Running 1236
Held: 440

The corresponding graph (run right after, so the numbers are slightly off):

screen 2018-11-08 at 4 57 15 pm

The solution is to:

  1. Use the correct values in the graphs
  2. Change the text below (and 440 Held jobs) from (and 2002 blocked jobs)
@ericfranz
Copy link
Contributor Author

Interestingly qstat displays this information:

efranz@owens-login02:~$ qstat -B -f | grep state_count -A 1
    state_count = Transit:0 Queued:2003 Held:438 Waiting:0 Running:1299 Exitin
	g:0 Complete:25

But not in XML format.

@ericfranz ericfranz assigned ericfranz and unassigned qianyuanzhu Nov 8, 2018
@ericfranz
Copy link
Contributor Author

I thought that f0f79a7 fixed it. However, it looks like that is not the case, it just fixed the problem Owens (sort've) but not Pitzer. Here is the summary, and then repeated with the details:

  1. On Pitzer moab eligible 367 + blocked 769 == 1136. qselect queued 1133 + held 3 == 1136. But there is no 1:1 correspondence between the two values
  2. On Owens, "blocked jobs" approximately equals "queued jobs"
  3. On Oakley, "blocked jobs" == "held jobs"
  4. On Ruby eligible + blocked jobs == queued jobs

On Pitzer moab eligible + blocked = 1136. select queued 1133 + held 3 == 1136. But there is no 1:1 correspondence between the two values:

efranz@pitzer-login01:~$ showq -s; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"; echo "Exiting $(qselect -s E | wc -l)"; qstat -B -f | grep state_count -A 1

active jobs: 124  eligible jobs: 367  blocked jobs: 769

Total jobs:  1260

NOTE:  system reservation blocking all nodes

Queued 1133
Running 124
Held: 3
Exiting 0
    state_count = Transit:0 Queued:1133 Held:3 Waiting:0 Running:124 Exiting:0
	 Complete:3

On Owens, "blocked jobs" approximately equals "queued jobs":

efranz@owens-login01:~$ showq -s; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"; echo "Exiting $(qselect -s E | wc -l)"; qstat -B -f | grep state_count -A 1

active jobs: 1288  eligible jobs: 374  blocked jobs: 2005

Total jobs:  3667


Queued 1924
Running 1288
Held: 455
Exiting 0
    state_count = Transit:0 Queued:1924 Held:455 Waiting:0 Running:1288 Exitin
	g:0 Complete:51

On Oakley, "blocked jobs" == "held jobs":

efranz@oakley01:~$ showq -s; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"; echo "Exiting $(qselect -s E | wc -l)"; qstat -B -f | grep state_count -A 1

active jobs: 463  eligible jobs: 0  blocked jobs: 7

Total jobs:  470


Queued 0
Running 462
Held: 7
Exiting 0
    state_count = Transit:0 Queued:0 Held:7 Waiting:0 Running:462 Exiting:0 Co
	mplete:2

On Ruby eligible + blocked jobs == queued jobs:

efranz@ruby01:~$ showq -s; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"; echo "Exiting $(qselect -s E | wc -l)"; qstat -B -f | grep state_count -A 1

active jobs: 56  eligible jobs: 2  blocked jobs: 2

Total jobs:  60


Queued 4
Running 56
Held: 0
Exiting 0
    state_count = Transit:0 Queued:4 Held:0 Waiting:0 Running:56 Exiting:0 Com
	plete:3

@ericfranz
Copy link
Contributor Author

The manpages don't help.

Active jobs are those that are Running or Starting and consuming resources.

Eligible Jobs are those that are queued and eligible to be scheduled.

Blocked jobs are those that are ineligible to be run or queued. Jobs
listed here could be in a number of states for the following reasons:
Idle, UserHold, SystemHold, BatchHold, Deferred, NotQueued

@ericfranz
Copy link
Contributor Author

It is possible that f0f79a7 was completely wrong, and that qstat will show jobs as being queued that the scheduler is blocking from running for whatever reason. That could explain the discrepancy. But for Owens this does seem excessive:

active jobs: 1274 eligible jobs: 355 blocked jobs: 1993

2000 blocked jobs?

@ericfranz
Copy link
Contributor Author

I reverted f0f79a7 on the master branch for now. Any fix as necessary can go on a separate branch.

@qianyuanzhu
Copy link
Contributor

Pitzer's rule works for all clusters(Eligible jobs+Blocked jobs=Queued+Held)
From your data:
On Owens: eligible jobs 374+blocked jobs 2005=Queued 1924+Held 455
On Ruby: eligible jobs 2+blocked jobs 2=Queued 4+Held 0
On Oakley: eligible jobs 0+blocked jobs 7=Queued 0+Held 7

Another Oakley sample:
eligible jobs 52+blocked jobs 330=Queued 98+Held 284

azhu ~ $ hostname
oakley01.osc.edu
azhu ~ $ showq -s; echo ""; echo "Queued $(qselect -s Q | wc -l)"; echo "Running $(qselect -s R | wc -l)"; echo "Held: $(qselect -s HWT | wc -l)"; echo "Exiting $(qselect -s E | wc -l)"; qstat -B -f | grep state_count -A 1

active jobs: 474  eligible jobs: 52  blocked jobs: 330  
Total jobs:  856

Queued 98
Running 474
Held: 284
Exiting 0
    state_count = Transit:0 Queued:98 Held:284 Waiting:0 Running:474 Exiting:0
         Complete:4 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants