Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task remote init: fix strange tempfile issue #2976

Merged

Conversation

matthewrmshin
Copy link
Contributor

@matthewrmshin matthewrmshin commented Mar 6, 2019

Users are experiencing issues on one of the platforms we deploy Cylc
where remote submission fails intermittently with a file not found error
complaining that a temporary file cannot be found. (The file is used for
piping service files as a tar archive via STDIN (to the SSH) command.) This should not
be the case, as the temporary file handle should still be referenced,
but clearly there was an issue.

In this change, we pass the temporary file handle directly instead of
passing the name of the temporary file to the process pool. This appears
to fix the issue. (Note: Before #2590, we were unable to pass the file
handle because the context needs to be serialised for the
multiprocessing.Pool. This is no longer a requirement in our own
subprocess pool, so it is now safe to pass the handle.)

Previously flow of logic roughly looks like this:

# In cylc.task_remote_mgr.TaskRemoteMgr.remote_init:
handle = NamedTemporaryFile()
# ... write to handle, and seek to position 0
# ... pass the "tmphandle" to the subprocess pool by file name
self.proc_pool.put_command(SubProcContext(..., [tmphandle.name], ...))
# ... time passes
# ... cylc.subprocpool.SubProcPool._run_command_init opens the the file
stdin_file = open(ctx.cmd_kwargs['stdin_file_paths'][0], 'rb')  # <= FAILED HERE
# ... SubProcPool done with command
# In cylc.task_remote_mgr.TaskRemoteMgr._remote_init_callback:
tmphandle.close()  # Should delete tmphandle.name as well

The new flow of logic roughly looks like this:

# In cylc.task_remote_mgr.TaskRemoteMgr.remote_init:
handle = NamedTemporaryFile()
# ... write to handle, and seek to position 0
# ... pass the "tmphandle" to the subprocess pool by file handle
self.proc_pool.put_command(SubProcContext(..., [tmphandle], ...))
# ... time passes
# ... cylc.subprocpool.SubProcPool._run_command_init opens the the file
if hasattr(ctx.cmd_kwargs['stdin_file_paths'][0], 'read'):
    stdin_file = ctx.cmd_kwargs['stdin_file_paths'][0]
else:
    # as before
# ... SubProcPool done with command
# In cylc.task_remote_mgr.TaskRemoteMgr._remote_init_callback:
tmphandle.close()  # Should delete tmphandle.name as well

Also renamed SuiteProcPool to SubProcPool to match module name and SubProcContext.

@matthewrmshin matthewrmshin added this to the next-release milestone Mar 6, 2019
@matthewrmshin matthewrmshin self-assigned this Mar 6, 2019
@hjoliver
Copy link
Member

hjoliver commented Mar 6, 2019

Python tempfile module docs for NamedTemporaryFile(): the file is guaranteed to have a visible name in the file system

So.. it seems suprising that "file not found" would occur for access via the filename, but not via the filehandle, if the file actually exists?? (and if it doesn't exist, then the filehandle ain't going to help!)

Just wondering!

@matthewrmshin
Copy link
Contributor Author

No, I don't understand it either. The failure is very repeatable on that system without the fix, so this fix is not a coincidence. I'll polish it up slightly tomorrow, and may even add a unit test for the new logic.

@matthewrmshin matthewrmshin requested review from hjoliver and wxtim March 7, 2019 12:25
@kinow
Copy link
Member

kinow commented Mar 8, 2019

Had a look at https://bugs.python.org/ but found nothing that could explain it.

Reading our code in task_remote_mgr.py, the only possible explanation for why this fix works, while keeping only the name would be garbage collector. This part

self.proc_pool.put_command(SubProcContext(..., [tmphandle.name], ...))
# ... time passes <--------------------- kinow: ENOUGH TIME FOR GARBAGE COLLECTOR (possibly...)
# ... cylc.subprocpool.SubProcPool._run_command_init opens the the file
stdin_file = open(ctx.cmd_kwargs['stdin_file_paths'][0], 'rb')  # <= FAILED HERE

In the comment "time passes"... we have a reference to the name only, which would make the object eligible to GC. But we also pass an array with the reference to the file, so it shouldn't be collected... unless that array is not stored anywhere of if it's variable is del'd or empty()'d.

Note from their documentation for TemporaryFile: It will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected). I think it would be applied to NamedTemporaryFile too.

That's my best guess :-) but this fix looks good for me 👍

@kinow
Copy link
Member

kinow commented Mar 8, 2019

(and there are differences for GC between Python versions, as well as OS's... so it could be some special behaviour... Py 3.7 I think is getting a threaded GC... so that would be... 2 GC algorithms for Python I think, with one being able to run in a thread? I think the JVM has something like 10? anywho)

@matthewrmshin
Copy link
Contributor Author

It only happens on a platform that runs Python 2.6.9 (but it does not happen on our other platforms that run Python 2.6.6 and Python 2.7.5).

Any way, I'll not investigate further and just treat it as a brain surgery - we poke at the system and hope to see better result. (In this case, we have a working system on the platform + seemingly better code.)

@hjoliver
Copy link
Member

hjoliver commented Mar 11, 2019

Garbage collection in the main process does sound like a plausible explanation here, as the temporary file is used in the subprocess??. If so, are you sure that use of filehandle instead of file name is not in danger of suffering the same problem? (I still can't see why that would make any difference!).

Python tempfile docs says the file...
will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected).

Perhaps we should be using non-temporary files and managing their deletion ourselves?

(Apart from not understanding this, the code looks fine!)

@matthewrmshin
Copy link
Contributor Author

The file handle is not closed until the callback, so it should be safe.

@hjoliver
Copy link
Member

Ah, so your fix does make sense from a GC perspective? (Are references to the file object lost earlier on master?)

@matthewrmshin
Copy link
Contributor Author

No, in the master version, the file name is passed down to the command context, but the file handle should also be preserved as the callback argument. There should be no GC.

@kinow
Copy link
Member

kinow commented Mar 11, 2019

I think this one should work. Assuming it was GC, now that we are explicitly keeping an instance of the file handle, I think it should not be collected.

@hjoliver
Copy link
Member

OK, let's merge this.

@hjoliver
Copy link
Member

Sorry @matthewrmshin - this just got conflicted by the Python 3 merge! (you'll also need a 7.8.x PR).

@matthewrmshin matthewrmshin force-pushed the fix-task-remote-init-tempfile-issue branch 2 times, most recently from 973ac51 to 6c4fb4a Compare March 13, 2019 14:28
@matthewrmshin
Copy link
Contributor Author

Tests happy again.

@matthewrmshin
Copy link
Contributor Author

@wxtim please sanity check.

Users are experiencing issues on one of the platforms we deloy Cylc
where remote submission fails intermittently with a file not found error
complaining that a temporary file cannot be found. (The file is used for
piping service files via STDIN (to the SSH) command.)

In this change, we pass the temporary file handle directly instead of
passing the name of the temporary file to the process pool. This appears
to fix the issue. (Note: Before cylc#2590, we were unable to pass the file
handle because the context needs to be serialised for the
multiprocessing.Pool. This is no longer a requirement in our own
subprocess pool, so it is now safer to pass the handle.)

Add basic unit tests for running command with STDIN in subprocess pool.
Rename SuiteProcPool to SubProcPool.
@matthewrmshin matthewrmshin force-pushed the fix-task-remote-init-tempfile-issue branch from 6c4fb4a to 430b880 Compare March 15, 2019 12:45
@matthewrmshin
Copy link
Contributor Author

Squashed, rebased, broke up tests into separate methods..

Copy link
Member

@wxtim wxtim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty sane to me. 👍

@wxtim wxtim merged commit a77f223 into cylc:master Mar 15, 2019
@matthewrmshin matthewrmshin mentioned this pull request Mar 19, 2019
@matthewrmshin matthewrmshin deleted the fix-task-remote-init-tempfile-issue branch June 3, 2019 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants