Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remote-init: fix bad unlink taking down the suite #2887

Merged

Conversation

matthewrmshin
Copy link
Contributor

A user has sent us a report that on closing the temporary file, the attempted os.unlink issued by the tempfile module on the temporary file was taking down the suite with OSError: [Errno 2] No such file or directory. Occurs on a platform that we have to support. However, I am still not sure how to repeat this in general.

@matthewrmshin matthewrmshin added this to the soon milestone Nov 26, 2018
@matthewrmshin matthewrmshin self-assigned this Nov 26, 2018
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any downside to this small change - might as well put in next release?

@matthewrmshin
Copy link
Contributor Author

Yes.

@hjoliver hjoliver modified the milestones: soon, next release Nov 26, 2018
@hjoliver hjoliver requested a review from kinow November 26, 2018 20:28
@hjoliver
Copy link
Member

@kinow - quick sanity check on this, then merge?

(I think 0% patch coverage is OK in this case :-)

@kinow
Copy link
Member

kinow commented Nov 26, 2018

Reproduced it locally! 🚀

However, I am still not sure how to repeat this in general.

Funny story. I worked with an infra team for something like almost two years in an insurance company. The funniest problem we had, was when we had to set up Redmine with Puppet, and it kept giving some random errors.

Long story short, we had automated the clean-up of the temporary directories in the server (requested by IT manager), and Redmine was using Ruby + Apache with WSGI, and it kept an important file in /tmp to keep track of the process running (similar to Cylc's contact).

We spent a long time chasing that bug, just to find we caused it in the first place 😬 the manager was not happy, but lesson learned. Later found other companies that had cron jobs or IT policies to delete temp files after midnight, or other users' processes doing that.

So I suspected that's what was happening.

Started the distributed Docker containers, and created the following suite

[scheduling]
[[dependencies]]
graph = "a => b"

[runtime]
[[a]]
script = "sleep 5"
[[b]]
script = "sleep 10" # enough time to see the file created
[[[remote]]]
host = 172.26.0.3 # this is the Docker SSH slave 1
owner = root # bad, but test only

Then just did a watch -n 1 ls -lah /tmp on the slave/master Docker containers to confirm that the temporary file was created.

screenshot_2018-11-27_09-58-03

But it was created and destroyed too fast for me to reproduce the error.

So I added some code line à la PHP style:

diff --git a/lib/cylc/task_remote_mgr.py b/lib/cylc/task_remote_mgr.py
index 3c6ce0847..d9c034fed 100644
--- a/lib/cylc/task_remote_mgr.py
+++ b/lib/cylc/task_remote_mgr.py
@@ -309,6 +309,9 @@ class TaskRemoteMgr(object):
     def _remote_init_callback(self, proc_ctx, host, owner, tmphandle):
         """Callback when "cylc remote-init" exits"""
         self.ready = True
+        LOG.debug("############ REMOTE INIT!")
+        import time
+        time.sleep(10)
         tmphandle.close()
         if proc_ctx.ret_code == 0:
             for status in (REMOTE_INIT_DONE, REMOTE_INIT_NOT_REQUIRED):

That gave me 10 seconds to look around /tmp and remove the file (/tmp is empty in the Docker container, so quite easy to just use auto-complete and delete it).

Removing the file, I think I reproduced the issue your user had @matthewrmshin .

screenshot_2018-11-27_09-58-41

We could ignore OSError, meaning we don't care if somebody changed the permission of the file or deleted and then a directory was created, or we could check if the file is there.

diff --git a/lib/cylc/task_remote_mgr.py b/lib/cylc/task_remote_mgr.py
index 3c6ce0847..292914861 100644
--- a/lib/cylc/task_remote_mgr.py
+++ b/lib/cylc/task_remote_mgr.py
@@ -309,7 +309,9 @@ class TaskRemoteMgr(object):
     def _remote_init_callback(self, proc_ctx, host, owner, tmphandle):
         """Callback when "cylc remote-init" exits"""
         self.ready = True
-        tmphandle.close()
+        # Ignore if the temporary file has already been removed
+        if os.path.isfile(tmphandle.name):
+            tmphandle.close()
         if proc_ctx.ret_code == 0:
             for status in (REMOTE_INIT_DONE, REMOTE_INIT_NOT_REQUIRED):
                 if status in proc_ctx.out:

But also happy to merge your fix @matthewrmshin , WDYT?

@kinow
Copy link
Member

kinow commented Nov 26, 2018

Looks good to me @hjoliver , just going to wait for Matt's reply and then merge 👍

@kinow
Copy link
Member

kinow commented Nov 26, 2018

(and I really have no strong preference for my fix Matt, I assume yours would catch file permission error, trying to remove a directory, etc)

@matthewrmshin
Copy link
Contributor Author

The temporary file is created by the method that submits the remote-init command. On completion of the command, the call back function is called. Nothing in between should have removed the temporary file...

Really, this error should be caught or handled by the tempfile's close method.😒

I'm not able to work on this until tomorrow morning, so merging my change is probably the best option for now to avoid delaying the release.

@hjoliver
Copy link
Member

Nice investigative work @kinow, but I think we'll stick with Matt's change as it's more general (I guess we don't know if the problem is deletion or permissions or what).

@hjoliver hjoliver added the small label Nov 26, 2018
@hjoliver hjoliver merged commit a113190 into cylc:master Nov 26, 2018
@matthewrmshin matthewrmshin deleted the fix-remote-init-tempfile-unlink-oserror branch January 8, 2019 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants