JobDB.update() breaks for single-home multi-queue clusters #3

tallakahath · 2016-06-30T19:27:02Z

Specific (generic) example:

A computing system has three clusters, A, B, and C. Each cluster has its own queue (i.e., jobs submitted while logged into A do NOT show up when logged into B and running qstat/squeue), BUT, all clusters share the same /home (e.g., A:/home/liz and B:/home/liz point to the same place).

If I run any pbs command (e.g. pstat) on A, /home/liz/.pbs/jobs.db is created and populated with info from A's queue. If I then construct a pbs.Job, and then pbs.Job.submit, a job is entered into /home/liz.pbs/jobs.db; lets call this job 1001. 1001 is marked as 'Q' and then 'R' as I run 'pstat' a few times and wait for the queue to clear.

Now, I exit A, and ssh into B. I run 'pstat' again, and pbs.JobDB.update is called. squeue/qstat doesn't see the job I submitted on A/in A's queue, so, as per the behavior in pbs/pbs/jobdb/update:

          # any jobs that we don't find with qstat should be marked as 'C'
          for f in sql_iter(self.curs):  
              newstatus[f["jobid"]] = "C"

The job I submitted on A, 1001, then gets marked as "C" in job.db. Now, even if I ssh back into A and run pstat again, I have a problem:

          # select jobs that are not yet marked complete
          self.curs.execute("SELECT jobid FROM jobs WHERE jobstatus!='C'")

Job 1001 is never checked during JobDB.update ever again, because pbs thinks its complete!

I'm not sure the best way of handling this, but maybe having jobs.db carry around data about which cluster/queue was used is needed, here. Then, only jobs native to that cluster/queue can be updated (and hence, since 1001 is native to A, a 'pstat' query yielding a JobDB.update call will NOT check 1001) The naive solution is to change to

          self.curs.execute("SELECT jobid FROM jobs")

thereby checking all jobs in the jobs.db. But this will quickly become time-consuming if a user is not regularly purging ~/.pbs/jobs.db (which they shouldn't have to do!).

Thoughts?

The text was updated successfully, but these errors were encountered:

tallakahath · 2016-06-30T21:14:00Z

OK, by using/abusing the 'hostname' field this is fixed. Branch and commit tallakahath/pbs@d24a465 fix this by adding a few lines and changing the SQL query:

# Parse our hostname so we can only select jobs from THIS host
#   Otherwise, if we're on a multiple-clusters-same-home setup,
#   we may incorrectly update jobs from one cluster onto the other
m = m = re.search(r"(.*?)(?=[^a-zA-Z0-9]*login.*)", self.hostname)   #pylint: disable=invalid-name
if m:
    hostname_regex = m.group(1) + ".*"
else:
    hostname_regex = self.hostname + ".*"

# select jobs that are not yet marked complete
self.curs.execute("SELECT jobid FROM jobs WHERE jobstatus!='C' AND hostname REGEXP ?",
                             (hostname_regex, ))

So now, the current hostname is grabbed, stripped (e.g. mycluster-login1 turns into mycluster, as all login nodes of mycluster should have the same queue), then regexp'd against existing hostname entries. As a result, if you're on cluster A, logged in to A-login1 or something, a JobDB.udpate call would only call jobs C[omplete] if they belong to A AND are not being returned by a qstat/squeue called on A.

tallakahath · 2016-06-30T21:14:42Z

Also, this commit can be cherry-picked and applied to stock pbs without needing all of my config/SLURM stuff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobDB.update() breaks for single-home multi-queue clusters #3

JobDB.update() breaks for single-home multi-queue clusters #3

tallakahath commented Jun 30, 2016 •

edited

Loading

tallakahath commented Jun 30, 2016

tallakahath commented Jun 30, 2016

JobDB.update() breaks for single-home multi-queue clusters #3

JobDB.update() breaks for single-home multi-queue clusters #3

Comments

tallakahath commented Jun 30, 2016 • edited Loading

tallakahath commented Jun 30, 2016

tallakahath commented Jun 30, 2016

tallakahath commented Jun 30, 2016 •

edited

Loading