-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remote-init: fix bad unlink taking down the suite #2887
remote-init: fix bad unlink taking down the suite #2887
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any downside to this small change - might as well put in next release?
Yes. |
@kinow - quick sanity check on this, then merge? (I think 0% patch coverage is OK in this case :-) |
Reproduced it locally! 🚀
Funny story. I worked with an infra team for something like almost two years in an insurance company. The funniest problem we had, was when we had to set up Redmine with Puppet, and it kept giving some random errors. Long story short, we had automated the clean-up of the temporary directories in the server (requested by IT manager), and Redmine was using Ruby + Apache with WSGI, and it kept an important file in We spent a long time chasing that bug, just to find we caused it in the first place 😬 the manager was not happy, but lesson learned. Later found other companies that had cron jobs or IT policies to delete temp files after midnight, or other users' processes doing that. So I suspected that's what was happening. Started the distributed Docker containers, and created the following suite
Then just did a But it was created and destroyed too fast for me to reproduce the error. So I added some code line à la PHP style: diff --git a/lib/cylc/task_remote_mgr.py b/lib/cylc/task_remote_mgr.py
index 3c6ce0847..d9c034fed 100644
--- a/lib/cylc/task_remote_mgr.py
+++ b/lib/cylc/task_remote_mgr.py
@@ -309,6 +309,9 @@ class TaskRemoteMgr(object):
def _remote_init_callback(self, proc_ctx, host, owner, tmphandle):
"""Callback when "cylc remote-init" exits"""
self.ready = True
+ LOG.debug("############ REMOTE INIT!")
+ import time
+ time.sleep(10)
tmphandle.close()
if proc_ctx.ret_code == 0:
for status in (REMOTE_INIT_DONE, REMOTE_INIT_NOT_REQUIRED): That gave me 10 seconds to look around Removing the file, I think I reproduced the issue your user had @matthewrmshin . We could ignore diff --git a/lib/cylc/task_remote_mgr.py b/lib/cylc/task_remote_mgr.py
index 3c6ce0847..292914861 100644
--- a/lib/cylc/task_remote_mgr.py
+++ b/lib/cylc/task_remote_mgr.py
@@ -309,7 +309,9 @@ class TaskRemoteMgr(object):
def _remote_init_callback(self, proc_ctx, host, owner, tmphandle):
"""Callback when "cylc remote-init" exits"""
self.ready = True
- tmphandle.close()
+ # Ignore if the temporary file has already been removed
+ if os.path.isfile(tmphandle.name):
+ tmphandle.close()
if proc_ctx.ret_code == 0:
for status in (REMOTE_INIT_DONE, REMOTE_INIT_NOT_REQUIRED):
if status in proc_ctx.out: But also happy to merge your fix @matthewrmshin , WDYT? |
Looks good to me @hjoliver , just going to wait for Matt's reply and then merge 👍 |
(and I really have no strong preference for my fix Matt, I assume yours would catch file permission error, trying to remove a directory, etc) |
The temporary file is created by the method that submits the remote-init command. On completion of the command, the call back function is called. Nothing in between should have removed the temporary file... Really, this error should be caught or handled by the tempfile's close method.😒 I'm not able to work on this until tomorrow morning, so merging my change is probably the best option for now to avoid delaying the release. |
Nice investigative work @kinow, but I think we'll stick with Matt's change as it's more general (I guess we don't know if the problem is deletion or permissions or what). |
A user has sent us a report that on closing the temporary file, the attempted
os.unlink
issued by thetempfile
module on the temporary file was taking down the suite withOSError: [Errno 2] No such file or directory
. Occurs on a platform that we have to support. However, I am still not sure how to repeat this in general.