Skip to content
This repository has been archived by the owner on Feb 4, 2021. It is now read-only.

Job failed to delete workspace #161

Closed
sloretz opened this issue Jan 18, 2019 · 21 comments
Closed

Job failed to delete workspace #161

sloretz opened this issue Jan 18, 2019 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@sloretz
Copy link

sloretz commented Jan 18, 2019

Similar to #45 (linux) and #85 (OSX) which are also errors about failing to delete a folder.

Failing nightlies on windows_portable with this error

windows_portable failed to delete the workspace at the start of a job

07:30:34 C:\J\workspace\nightly_win_deb>python -u run_ros2_batch.py !CI_ARGS! 
07:30:35 II> run_ros2_batch called with args:
07:30:35 II>   - packaging=False
07:30:35 II>   - repo_file_url=https://raw.githubusercontent.com/ros2/ros2/master/ros2.repos
07:30:35 II>   - supplemental_repo_file_url=None
07:30:35 II>   - test_branch=None
07:30:35 II>   - colcon_branch=None
07:30:35 II>   - white_space_in=None
07:30:35 II>   - do_venv=False
07:30:35 II>   - os=None
07:30:35 II>   - ignore_rmw=['rmw_fastrtps_dynamic_cpp', 'rmw_opensplice_cpp']
07:30:35 II>   - connext_debs=False
07:30:35 II>   - isolated=False
07:30:35 II>   - force_ansi_color=True
07:30:35 II>   - ros1_path=None
07:30:35 II>   - test_bridge=False
07:30:35 II>   - cmake_build_type=Debug
07:30:35 II>   - build_args=['--event-handlers', 'console_cohesion+', 'console_package_list+', '--cmake-args', '-DINSTALL_EXAMPLES=OFF', '-DSECURITY=ON']
07:30:35 II>   - test_args=['--event-handlers', 'console_direct+', '--executor', 'sequential', '--retest-until-pass', '10']
07:30:35 II>   - src_mounted=False
07:30:35 II>   - coverage=False
07:30:35 II>   - workspace_path=C:\J\workspace\nightly_win_deb
07:30:35 II>   - python_interpreter=C:\Python37\python_d.exe
07:30:35 II> Using workspace: ws
07:30:35 WW> Deleting the folder at 'ws'.
07:30:35 Traceback (most recent call last):
07:30:35   File "C:\Python37\lib\shutil.py", line 395, in _rmtree_unsafe
07:30:35     os.unlink(fullname)
07:30:35 PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'ws\\log\\build_2019-01-15_23-32-01\\logger_all.log'
07:30:35 
07:30:35 During handling of the above exception, another exception occurred:
07:30:35 
07:30:35 Traceback (most recent call last):
07:30:35   File "run_ros2_batch.py", line 32, in <module>
07:30:35     sys.exit(main())
07:30:35   File "C:\J\workspace\nightly_win_deb\ros2_batch_job\__main__.py", line 132, in main
07:30:35     return run(args, build_function, blacklisted_package_names=blacklisted_package_names)
07:30:35   File "C:\J\workspace\nightly_win_deb\ros2_batch_job\__main__.py", line 404, in run
07:30:35     remove_folder(args.workspace)
07:30:35   File "C:\J\workspace\nightly_win_deb\ros2_batch_job\util.py", line 114, in remove_folder
07:30:35     shutil.rmtree(path, onerror=del_rw)
07:30:35   File "C:\Python37\lib\shutil.py", line 513, in rmtree
07:30:35     return _rmtree_unsafe(path, onerror)
07:30:35   File "C:\Python37\lib\shutil.py", line 392, in _rmtree_unsafe
07:30:35     _rmtree_unsafe(fullname, onerror)
07:30:35   File "C:\Python37\lib\shutil.py", line 392, in _rmtree_unsafe
07:30:35     _rmtree_unsafe(fullname, onerror)
07:30:35   File "C:\Python37\lib\shutil.py", line 397, in _rmtree_unsafe
07:30:35     onerror(os.unlink, fullname, sys.exc_info())
07:30:35   File "C:\J\workspace\nightly_win_deb\ros2_batch_job\util.py", line 113, in del_rw
07:30:35     os.remove(name)
07:30:35 PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'ws\\log\\build_2019-01-15_23-32-01\\logger_all.log'
@clalancette
Copy link

I saw another instance of this problem here: https://ci.ros2.org/job/ci_windows/5940/console

@dirk-thomas
Copy link
Member

I saw another instance of this problem here:

Please post the event log of the build which wasn't cleaned up somewhere in order to debug the cause - either here or on colcon/colcon-core#157.

@clalancette

This comment has been minimized.

@dirk-thomas

This comment has been minimized.

@dirk-thomas
Copy link
Member

Once the necessary debug informtation has been obtained revert the hack from ros2/ci#242 by merging ros2/ci#243.

@sloretz
Copy link
Author

sloretz commented Feb 14, 2019

I think I see the same issue on portable. Slightly different text on the job that failed, but there is a very long running colcon process using no resources causing later jobs to fail.

nightly_win_dep 1158 failed with

"Access is denied" when uninstalling colcon
00:21:58 ==> "C:\Python37\python_d.exe" -m pip uninstall -y colcon-core colcon-defaults colcon-library-path colcon-metadata colcon-output colcon-package-information colcon-package-selection colcon-parallel-executor colcon-powershell colcon-python-setup-py colcon-recursive-crawl colcon-test-result colcon-cmake colcon-ros
00:21:59 C:\Python37\lib\site-packages\pip\_vendor\html5lib\_trie\_base.py:3: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
00:21:59   from collections import Mapping
00:21:59 Uninstalling colcon-core-0.3.17:
00:21:59   Successfully uninstalled colcon-core-0.3.17
00:22:02 Exception:
00:22:02 Traceback (most recent call last):
00:22:02   File "C:\Python37\lib\shutil.py", line 395, in _rmtree_unsafe
00:22:02     os.unlink(fullname)
00:22:02 PermissionError: [WinError 5] Access is denied: 'C:\\WINDOWS\\TEMP\\pip-uninstall-6na5dfs1\\colcon.exe'
00:22:02 
00:22:02 During handling of the above exception, another exception occurred:
00:22:02 
00:22:02 Traceback (most recent call last):
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\cli\base_command.py", line 179, in main
00:22:02     status = self.run(options, args)
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\commands\uninstall.py", line 78, in run
00:22:02     uninstall_pathset.commit()
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\req\req_uninstall.py", line 412, in commit
00:22:02     self._moved_paths.commit()
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\req\req_uninstall.py", line 263, in commit
00:22:02     save_dir.cleanup()
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\utils\temp_dir.py", line 83, in cleanup
00:22:02     rmtree(self.path)
00:22:02   File "C:\Python37\lib\site-packages\pip\_vendor\retrying.py", line 49, in wrapped_f
00:22:02     return Retrying(*dargs, **dkw).call(f, *args, **kw)
00:22:02   File "C:\Python37\lib\site-packages\pip\_vendor\retrying.py", line 212, in call
00:22:02     raise attempt.get()
00:22:02   File "C:\Python37\lib\site-packages\pip\_vendor\retrying.py", line 247, in get
00:22:02     six.reraise(self.value[0], self.value[1], self.value[2])
00:22:02   File "C:\Python37\lib\site-packages\pip\_vendor\six.py", line 693, in reraise
00:22:02     raise value
00:22:02   File "C:\Python37\lib\site-packages\pip\_vendor\retrying.py", line 200, in call
00:22:02     attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\utils\misc.py", line 121, in rmtree
00:22:02     onerror=rmtree_errorhandler)
00:22:02   File "C:\Python37\lib\shutil.py", line 513, in rmtree
00:22:02     return _rmtree_unsafe(path, onerror)
00:22:02   File "C:\Python37\lib\shutil.py", line 397, in _rmtree_unsafe
00:22:02     onerror(os.unlink, fullname, sys.exc_info())
00:22:02   File "C:\Python37\lib\site-packages\pip\_internal\utils\misc.py", line 133, in rmtree_errorhandler
00:22:02     func(path)
00:22:02 PermissionError: [WinError 5] Access is denied: 'C:\\WINDOWS\\TEMP\\pip-uninstall-6na5dfs1\\colcon.exe'
00:22:02 <== '"C:\Python37\python_d.exe" -m pip uninstall -y colcon-core colcon-defaults colcon-library-path colcon-metadata colcon-output colcon-package-information colcon-package-selection colcon-parallel-executor colcon-powershell colcon-python-setup-py colcon-recursive-crawl colcon-test-result colcon-cmake colcon-ros' exited with return code '2'

I see a colcon process that has been up for a long time, not using much CPU, and has spawned a child cmd.exe process in working directory
C:\J\workspace\ci_windows\ws\build\test_rclcpp\

with the command

C:\WINDOWS\system32\cmd.exe /c ""C:\J\workspace\ci_windows\ws\build\test_rclcpp\colcon_command_prefix_test.bat" && set"

I think this process came from ci_windows 6041 which failed due to a java traceback earlier.

Jenkins log for ci_windows 6041: consoleText_ci_windows_6041.txt
Colcon event log for ci_windows 6041: ci_windows_events.zip

@sloretz
Copy link
Author

sloretz commented Feb 14, 2019

The last thing I see in the colcon event log is starting the job test_rclcpp, followed by TimerEvent for the rest of time. @dirk-thomas it looks like the event loop in colcon isn't stuck, but one of the commands it is trying to run is. Any ideas on what other info to collect?

[2980.076807] (test_cli_remapping) CommandEnded: {'returncode': 3221225786}
[2980.084785] (test_cli_remapping) JobEnded: {'identifier': 'test_cli_remapping', 'rc': 3221225786}
[2980.087778] (test_rclcpp) JobStarted: {'identifier': 'test_rclcpp'}
[2980.131525] (-) TimerEvent: {}
[2980.241547] (-) TimerEvent: {}
[2980.352250] (-) TimerEvent: {}
...
[53451.068315] (-) TimerEvent: {}
[53451.178410] (-) TimerEvent: {}
[53451.290470] (-) TimerEvent: {}

@sloretz
Copy link
Author

sloretz commented Feb 14, 2019

There is another colcon process on portable from nightly_win_rel 1096. The stuck process spawned by colcon is in working directory

C:\J\workspace\nightly_win_rel\ws\build\qt_gui_core\

with command

C:\WINDOWS\system32\cmd.exe /c ""C:\J\workspace\nightly_win_rel\ws\build\qt_gui_core\colcon_command_prefix_build.bat" && set"

However the event's log looks very different. I don't see TimerEvent spam, just a note about receiving SIGINT.

nightly_win_rel_1096_events.log

[883.493055] (rmw_connext_shared_cpp) StdoutLine: {'line': b'Done Building Project "C:\\J\\workspace\\nightly_win_rel\\ws\\build\\rmw_connext_shared_cpp\\rmw_connext_shared_cpp.vcxproj" (default targets) -- FAILED.\r\n'}
[883.493055] (rmw_connext_shared_cpp) StdoutLine: {'line': b'\r\n'}
[883.493055] (rmw_connext_shared_cpp) StdoutLine: {'line': b'Build FAILED.\r\n'}
[883.508676] (rmw_connext_shared_cpp) StdoutLine: {'line': b'    0 Warning(s)\r\n'}
[883.508676] (rmw_connext_shared_cpp) StdoutLine: {'line': b'    0 Error(s)\r\n'}
[883.508676] (rmw_connext_shared_cpp) StdoutLine: {'line': b'\r\n'}
[883.524297] (rmw_connext_shared_cpp) StdoutLine: {'line': b'Time Elapsed 00:00:07.86\r\n'}
[883.524297] (rmw_connext_shared_cpp) JobEnded: {'identifier': 'rmw_connext_shared_cpp', 'rc': 'SIGINT'}

Maybe nightly_win_rel is an instance of colcon/colcon-core#58 while the ci_windows failure is something else? They have similar symptoms: both start with a job failing with a java traceback, and result in a cmd.exe process spawned by colcon not terminating.

@sloretz
Copy link
Author

sloretz commented Feb 14, 2019

Icecube also has 4 stuck colcon processes from different jobs. These are interesting because the most child process is python.exe for 3 of them and cmd.exe for one.

Stuck process from packaging_windows 1278

  • Jenkins job https://ci.ros2.org/job/packaging_windows/1278/
  • Stuck process
    • Working directory
      C:\J\workspace\packaging_windows\ws\
      
    • command
      "c:\python37\python.exe"  "C:\Python37\Scripts\colcon.exe" build --base-paths "src" --build-base "build" --install-base "install" --merge-install --event-handlers console_cohesion+ console_package_list+ --cmake-args -DBUILD_TESTING=OFF --no-warn-unused-cli -DCMAKE_BUILD_TYPE=RelWithDebInfo -DINSTALL_EXAMPLES=OFF -DSECURITY=ON
      
  • Colcon event log

Stuck process from nightly_win_deb 1157

  • Jenkins job https://ci.ros2.org/job/nightly_win_deb/1157/
  • Stuck process
    • Working directory
      C:\J\workspace\nightly_win_deb\ws\
      
    • command
      "c:\python37\python_d.exe"  "C:\Python37\Scripts\colcon.exe" build --base-paths "src" --build-base "build" --install-base "install" --merge-install --event-handlers console_cohesion+ console_package_list+ --cmake-args -DBUILD_TESTING=ON --no-warn-unused-cli -DCMAKE_BUILD_TYPE=Debug -DINSTALL_EXAMPLES=OFF -DSECURITY=ON
      
  • Colcon event log

Stuck process from ci_windows 6046

  • Jenkins job https://ci.ros2.org/job/ci_windows/6046/
  • Stuck process
    • Working directory
      C:\J\workspace\ci_windows\ws\
      
    • command
      "c:\python37\python.exe"  "C:\Python37\Scripts\colcon.exe" test --base-paths "src" --build-base "build" --install-base "install" --merge-install --event-handlers console_direct+ --executor sequential --retest-until-pass 10
      
  • Colcon event log

Stuck process from ci_windows 6051

  • Jenkins job https://ci.ros2.org/job/ci_windows/6051/
  • Stuck process
    • Working directory
      C:\J\workspace\ci_windows\ws\build\test_rclcpp\
      
    • command
      C:\WINDOWS\system32\cmd.exe /c ""C:\J\workspace\ci_windows\ws\build\test_rclcpp\colcon_command_prefix_test.bat" && set"
      
  • Colcon event log

@dirk-thomas
Copy link
Member

Just for the record the order of failed jobs and their summary:

The pattern is always the same: one job fails because Jenkins looses the connection to the node and as a consequence doesn't terminate the colcon process (and/or one of its subprocessed). The next job (running either long time later as in the first case or shortly afterwards as in the other three cases) fails due to the open file handle on the colcon executable when trying to uninstall colcon.

  • Variation A: The colcon event log of the first job continues to be filled with [<timestamp>] (-) TimerEvent: {}. In this case a process running colcon is still around and waiting for pending jobs to finish. The assumption is that these jobs have ended already (since sometimes there are hours between the two builds) but that wasn't signaled correctly.
  • Variation B: The colcon event log of the first job ends in the middle while some jobs are still pending. In this case one of the subprocessed is still around while the main colcon invocation has been terminated.

For a more detailed analysis about the internal state of colcon we need the logger_all.log from the first build (which causes the still running process). (While those would provide even more information I am not sure they will lead us to a fix.)

While this problem could be Windows specific we would likely not notice if the same happens on other platforms since their second build wouldn't fail since colcon can be uninstalled even though a process of it is still running. So it could also be a general problem with the termination of processed on the node in case the connection to the master is broken.

@dirk-thomas dirk-thomas added the bug Something isn't working label Feb 14, 2019
@dirk-thomas
Copy link
Member

Thinking about this sounds like a problem in Jenkins something we have dealt with in the past and working around in ros_buildfarm with the subprocess_reaper.py script...

@wjwwood
Copy link
Member

wjwwood commented Jun 13, 2019

Two more instances of this today with both portable and icecube:

@wjwwood
Copy link
Member

wjwwood commented Jul 19, 2019

Another instance on icecube today: https://ci.ros2.org/job/ci_windows/7463/

@sloretz
Copy link
Author

sloretz commented Jul 26, 2019

Another instance on windshield today https://ci.ros2.org/job/ci_windows/7520/console

@clalancette
Copy link

All 3 of the Windows hosts (icecube, windshield, and portable) had this problem today. Just killing off colcon.exe wasn't enough; there still seemed to be rogue Python processes and "Windows Command Processors" running. I ended up rebooting them all.

@wjwwood
Copy link
Member

wjwwood commented Jul 29, 2019

Thanks @clalancette for getting the farm turning again (we needed that).

However, if anyone else sees this in the future, please leave one aside so I can try out a script that I might add to CI so it can resolve this itself in the future.

@sloretz
Copy link
Author

sloretz commented Aug 8, 2019

Another instance of a failure to delete workspace on icecube https://ci.ros2.org/job/ci_windows/7683/console

There were two colcon processes and one launch_test process running. One colon process was for nightly_win_rep, and the other for ci_windows.

@nuclearsandwich
Copy link
Member

This issue has been recurring frequently. For recent occurrences I've been trying to use ProcessExplorer to find out what extant processes are still holding filehandles open. I wasn't able to find one this morning but while cleaning up after aborting a hung job on windshield, I found a ros2 daemon process still running and holding on to dynamic libraries.

Process info:

windows-daemon-process

Open filehandles matching the Jenkins workspace path
windows-filehandles

I don't know if the daemon process was created during the aborted CI run. This might be a one-off or it might be part of why we're still having issues even after ros2/ci#323.

I'll keep looking into new instances of this.

@nuclearsandwich
Copy link
Member

I've been reading up on process control mechanisms available on Windows that might help us combat this. If we can wrap our process tree in a Job Object closing all handles to the job should kill all processes that are part of that job and their descendant processes. I tried to get an experiment together but today my Windows VM is causing my whole machine to lock up so I haven't been able to explore further.

@nuclearsandwich
Copy link
Member

ros2/ci#374 brings in the subprocess_reaper.py script as recommended by @dirk-thomas

@clalancette
Copy link

We aren't using this repository anymore for buildfarm issues, so I'm going to archive it. Thus I'm closing out this issue. If you continue to have problems, please report another bug against https://github.com/ros2/ros2. Thank you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants