Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobDB.update() breaks for single-home multi-queue clusters #3

Open
tallakahath opened this issue Jun 30, 2016 · 2 comments
Open

JobDB.update() breaks for single-home multi-queue clusters #3

tallakahath opened this issue Jun 30, 2016 · 2 comments

Comments

@tallakahath
Copy link
Contributor

tallakahath commented Jun 30, 2016

Specific (generic) example:

A computing system has three clusters, A, B, and C. Each cluster has its own queue (i.e., jobs submitted while logged into A do NOT show up when logged into B and running qstat/squeue), BUT, all clusters share the same /home (e.g., A:/home/liz and B:/home/liz point to the same place).

If I run any pbs command (e.g. pstat) on A, /home/liz/.pbs/jobs.db is created and populated with info from A's queue. If I then construct a pbs.Job, and then pbs.Job.submit, a job is entered into /home/liz.pbs/jobs.db; lets call this job 1001. 1001 is marked as 'Q' and then 'R' as I run 'pstat' a few times and wait for the queue to clear.

Now, I exit A, and ssh into B. I run 'pstat' again, and pbs.JobDB.update is called. squeue/qstat doesn't see the job I submitted on A/in A's queue, so, as per the behavior in pbs/pbs/jobdb/update:

          # any jobs that we don't find with qstat should be marked as 'C'
          for f in sql_iter(self.curs):  
              newstatus[f["jobid"]] = "C"

The job I submitted on A, 1001, then gets marked as "C" in job.db. Now, even if I ssh back into A and run pstat again, I have a problem:

          # select jobs that are not yet marked complete
          self.curs.execute("SELECT jobid FROM jobs WHERE jobstatus!='C'")

Job 1001 is never checked during JobDB.update ever again, because pbs thinks its complete!

I'm not sure the best way of handling this, but maybe having jobs.db carry around data about which cluster/queue was used is needed, here. Then, only jobs native to that cluster/queue can be updated (and hence, since 1001 is native to A, a 'pstat' query yielding a JobDB.update call will NOT check 1001) The naive solution is to change to

          self.curs.execute("SELECT jobid FROM jobs")

thereby checking all jobs in the jobs.db. But this will quickly become time-consuming if a user is not regularly purging ~/.pbs/jobs.db (which they shouldn't have to do!).

Thoughts?

@tallakahath
Copy link
Contributor Author

OK, by using/abusing the 'hostname' field this is fixed. Branch and commit tallakahath/pbs@d24a465 fix this by adding a few lines and changing the SQL query:

# Parse our hostname so we can only select jobs from THIS host
#   Otherwise, if we're on a multiple-clusters-same-home setup,
#   we may incorrectly update jobs from one cluster onto the other
m = m = re.search(r"(.*?)(?=[^a-zA-Z0-9]*login.*)", self.hostname)   #pylint: disable=invalid-name
if m:
    hostname_regex = m.group(1) + ".*"
else:
    hostname_regex = self.hostname + ".*"

# select jobs that are not yet marked complete
self.curs.execute("SELECT jobid FROM jobs WHERE jobstatus!='C' AND hostname REGEXP ?",
                             (hostname_regex, ))

So now, the current hostname is grabbed, stripped (e.g. mycluster-login1 turns into mycluster, as all login nodes of mycluster should have the same queue), then regexp'd against existing hostname entries. As a result, if you're on cluster A, logged in to A-login1 or something, a JobDB.udpate call would only call jobs C[omplete] if they belong to A AND are not being returned by a qstat/squeue called on A.

@tallakahath
Copy link
Contributor Author

Also, this commit can be cherry-picked and applied to stock pbs without needing all of my config/SLURM stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant