Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic splitting fails on missing throughput file #8792

Open
belforte opened this issue Nov 14, 2024 · 10 comments
Open

automatic splitting fails on missing throughput file #8792

belforte opened this issue Nov 14, 2024 · 10 comments
Assignees

Comments

@belforte
Copy link
Member

I found this while looking at stuck automatic task in the CI pipeline
https://cmsweb-testbed.cern.ch/crabserver/ui/task/241113_203248%3Acrabint1_crab_20241113_213248

[crabtw@vocms059 SPOOL_DIR]$ cat prejob_logs/predag.1.txt 
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG Pre-DAG started with output redirected to /data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/prejob_logs/predag.1.txt
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 1 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 5 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1
Got a fatal exception: [Errno 2] No such file or directory: 'automatic_splitting/throughputs/0-4'
Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py", line 24, in <module>
    retval = bootstrap()
  File "/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py", line 18, in bootstrap
    return PreDAG.PreDAG().execute(*sys.argv[2:])
  File "/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/Actions/PreDAG.py", line 135, in execute
    retval = self.executeInternal(*args)
  File "/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/Actions/PreDAG.py", line 244, in executeInternal
    with open(fn, 'r', encoding='utf-8') as fd:
FileNotFoundError: [Errno 2] No such file or directory: 'automatic_splitting/throughputs/0-4'
[crabtw@vocms059 SPOOL_DIR]$ ls automatic_splitting/
processed  throughputs
[crabtw@vocms059 SPOOL_DIR]$ ls automatic_splitting/throughputs/
0-1  0-2  0-3  0-5
[crabtw@vocms059 SPOOL_DIR]$ 


@belforte belforte self-assigned this Nov 14, 2024
@belforte
Copy link
Member Author

probe #4 failed, but that's "normal". Usually it does not result in error.
Also main processing was submitted, in spite of the error message. But it failed and tails jobs were not submitted.
image

@belforte
Copy link
Member Author

looks like the problem is to run a tail step w/o info on the main one. The file automatic_splitting/processed only has info from the probe jobs

>>> f=open('automatic_splitting/processed','rb')
>>> r=pickle.load(f)
>>> r
{'0-1', '0-3', '0-4', '0-5', '0-2'}
>>> 

which may have something to do with the odd log

Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 1 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG found 5 completed jobs
Wed, 13 Nov 2024 21:59:30 CET(+0100):INFO:PreDAG jobs remaining to process: 1

@belforte
Copy link
Member Author

last lines of dag_bootstrap.out

Entering TaskManagerBootstrap with args: ['/data/srv/glidecondor/condor_local/spool/6837/0/cluster9976837.proc0.subproc0/TaskWorker/TaskManagerBootstrap.py', 'PREDAG', 'tail', '1', '1']
Wed, 13 Nov 2024 21:59:30 CET(+0100):DEBUG:PreDAG Acquiring PreDAG lock
Wed, 13 Nov 2024 21:59:30 CET(+0100):DEBUG:PreDAG PreDAGlock acquired

@belforte
Copy link
Member Author

after some debugging I believe that in here

def completedJobs(self, stage, processFailed=True):
"""Yield job IDs of completed (finished or failed) jobs. All
failed jobs are saved in self.failedJobs, too.
"""
stagere = {}
stagere['processing'] = re.compile(r"^0-\d+$")
stagere['tail'] = re.compile(r"^[1-9]\d*$")
completedCount = 0
for jobnr, jobdict in self.statusCacheInfo.items():
state = jobdict.get('State')
if stagere[stage].match(jobnr) and state in ('finished', 'failed'):
if state == 'failed' and processFailed:
self.failedJobs.append(jobnr)
completedCount += 1
yield jobnr
self.logger.info("found %s completed jobs", completedCount)

Line 123 is wrong, it should be instead

if state == 'failed' and not processFailed: 

That will make the error in this issue go away. But since things are usually working, there may be something more.

@belforte
Copy link
Member Author

I believe this problem only happens when all processing jobs fail (in our test there is only one such job ! so chances of this increase) and some probe failed.
This lines tell PreDag to use probe jobs for the new splitting estimate since processing jobs failed

if self.stage == 'tail' and len(estimates-set(self.failedJobs)) == 0:
estimates = set(self.completedJobs(stage='processing', processFailed=False))

but for reasons obscure to me processFailed is set to False which makes failed jobs NOT skipped (!!)
if state == 'failed' and processFailed:
self.failedJobs.append(jobnr)

while "normally" the default processFailed=True is used
completed = set(self.completedJobs(stage=self.stage))

Surely variable naming is confusing !
Also the fact that the completedJobs() method looks for completed jobs in the stage preceding the one indicated as argument

stagere['processing'] = re.compile(r"^0-\d+$")
stagere['tail'] = re.compile(r"^[1-9]\d*$")

At this point I have no idea why the processFailed argument was introduced. Seems like the original implementor wanted to say:
"if all jobs in this stage failed, go back to previous stage to generate splitting parameters, but take into account also failed jobs which were skipped before."

But failed jobs have no throughput report, so can't be used !

@belforte
Copy link
Member Author

I made that task DAG complete successfully by rerunning PreDag manually after changing

estimates = set(self.completedJobs(stage='processing', processFailed=False))
to

         estimates = set(self.completedJobs(stage='processing', processFailed=True))

which basically forces submission of a tail job with same config. as the processing one (OK, since the failure was an accidental 8028).

But I am still worried that making the change in the code for everybody may trigger problems in different situations which I can not imagine/test no.

@belforte
Copy link
Member Author

belforte commented Nov 14, 2024

Maybe there are situations where processing jobs fail, but still produce a report ? E.g. if they hit the time limit ?

# time-limited running is used by automatic splitting probe jobs
if opts.maxRuntime:
maxSecondsUntilRampdown = f"customTypeCms.untracked.int32({opts.maxRuntime})"
tweak.addParameter("process.maxSecondsUntilRampdown.input", maxSecondsUntilRampdown)

Or will they count as successful ?

@belforte
Copy link
Member Author

that processFailed argument was introduced in 90c0841

No comments. no issue.

I am still unsure what to do.

@belforte
Copy link
Member Author

belforte commented Nov 14, 2024

some (but not all) probe jobs failing and all processing jobs failing is all in all a very rare case.
Also fixing this does not help the CI pipeline to complete becasue current code still sees one failed job (the processing one) and keeps trying to resubmit.
Will have to fix the pipeline first to properly deal with automatic splitting: #8794

belforte added a commit to belforte/CRABServer that referenced this issue Nov 14, 2024
@belforte
Copy link
Member Author

I have prepared a PR with that fix. But need to think more about possible side effects

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant