Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bisect: deadlock with updating ground truth in auto mode #229

Closed
mikebentley15 opened this issue Oct 10, 2018 · 1 comment · Fixed by #231
Closed

bisect: deadlock with updating ground truth in auto mode #229

mikebentley15 opened this issue Oct 10, 2018 · 1 comment · Fixed by #231
Labels
python Involves touching python code tests Involves touching tests

Comments

@mikebentley15
Copy link
Collaborator

Bug Report

Describe the problem
When running flit bisect with the --auto-sqlite-run mode, a deadlock can occur if one of the parallel processes fail to run the ground-truth update. There may be other places too. I think it's mostly about uncaught exceptions by these child processes where the main process fails to wait on them and they are then defunct processes, unable to signal information to the parent process necessary to make progress.

This was output to the console for me right before it deadlocked:

flit bisect --precision double "./clang++-wrap -O1 -freciprocal-math" ex05_Test
Updating ground-truth results - ground-truth.csvProcess Process-1:
Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/uufs/chpc.utah.edu/common/home/u0415196/git/FLiT/scripts/flitcli/flit_bisect.py", line 1762, in auto_bisect_worker
    num, libs, srcs, syms, ret = run_bisect(row_args)
  File "/uufs/chpc.utah.edu/common/home/u0415196/git/FLiT/scripts/flitcli/flit_bisect.py", line 1536, in run_bisect
    update_gt_results(args.directory, verbose=args.verbose, jobs=args.jobs)
  File "/uufs/chpc.utah.edu/common/home/u0415196/git/FLiT/scripts/flitcli/flit_bisect.py", line 329, in update_gt_results
    ['make', '-j', str(jobs), '-C', directory, gt_resultfile], **kwargs)
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['make', '-j', '56', '-C', '.', 'ground-truth.csv']' returned non-zero exit status 2.

And then when I killed the process with CTRL-C (i.e. a SIGINT signal), I got the following traceback:

Traceback (most recent call last):
  File "/uufs/chpc.utah.edu/common/home/u0415196/bin/flit", line 233, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/uufs/chpc.utah.edu/common/home/u0415196/bin/flit", line 172, in main
    return _main_impl(arguments)
  File "/uufs/chpc.utah.edu/common/home/u0415196/bin/flit", line 230, in _main_impl
    arguments, prog='{0} {1}'.format(sys.argv[0], subcommand))
  File "/uufs/chpc.utah.edu/common/home/u0415196/git/FLiT/scripts/flitcli/flit_bisect.py", line 1946, in main
    return parallel_auto_bisect(arguments, prog)
  File "/uufs/chpc.utah.edu/common/home/u0415196/git/FLiT/scripts/flitcli/flit_bisect.py", line 1895, in parallel_auto_bisect
    row, num, libs, srcs, syms, ret = result_queue.get()
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/uufs/chpc.utah.edu/common/home/u0415196/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

The problem happens because the return queue is not being filled with an entry when an uncaught exception occurs

Suggested Fix
Put the worker process within a try..except block that will add an empty entry into the return queue before reraising the uncaught exception. I think it is useful to have the exception information printed to the console for future bug reports.

It would also be good to add another except block that captures subprocess.CalledProcessError and prints the error output to the console and the log. Alternatively, add that functionality to update_gt_results, and anywhere else check_call or checkout_output are called.

Alternative approaches:
Use something other than a queue to communicate between parent and children.

@mikebentley15 mikebentley15 added python Involves touching python code tests Involves touching tests labels Oct 10, 2018
@mikebentley15
Copy link
Collaborator Author

Also, the POpen call done in build_bisect is suspect. Since it could maybe fail and continue going as a silent failure. I say maybe because it might fail and throw on the subsequent check_call function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python Involves touching python code tests Involves touching tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant