Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove switching directory to load results #3047

Closed
wants to merge 1 commit into from

Conversation

satra
Copy link
Member

@satra satra commented Sep 24, 2019

Summary

Fixes #3014 . @stilley2 - quick change to test directory changes.

List of changes proposed in this PR (pull-request)

  • remove indirectory

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

@satra satra requested a review from oesteban September 24, 2019 15:40
@effigies
Copy link
Member

Wasn't the reason for changing directories to resolve path resolution if relative paths are saved?

@@ -707,34 +707,33 @@ def loadpkl(infile):
pkl_metadata = None

unpkl = None
with indirectory(infile.parent):
with SoftFileLock('%s.lock' % infile.name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to continue changing directories while correctly locking, what if you just switched (and locked the absolute path):

-    with indirectory(infile.parent):
-        with SoftFileLock('%s.lock' % infile.name):
+    with SoftFileLock('%s.lock' % infile):
+        with indirectory(infile.parent):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still has potential race conditions when it comes to loading the actual results file.

@satra
Copy link
Member Author

satra commented Sep 24, 2019

that's why i'm asking for @oesteban's input.

for loading, we can resolve the new path, without changing directories. i don't see the need for switching directories.

@stilley2
Copy link
Contributor

Wouldn't a cleaner solution be to change the behavior of _async_callback?

def _async_callback(self, args):
Or at least call don't call it asynchronously. Having an asynchronous chdir seems dangerous

@stilley2
Copy link
Contributor

In fact, can we just remove that chdir in _async_callback? It was added in bcc25fd, but maybe it's not needed anymore.

@satra
Copy link
Member Author

satra commented Sep 24, 2019

In fact, can we just remove that chdir in _async_callback? It was added in bcc25fd, but maybe it's not needed anymore.

i'll let @oesteban comment on this. indeed, i think we have to be careful of where we are switching directories and what impact it has on the main thread.

@stilley2 - regarding the race condition, i thought about it a little more, and i think it's unlikely we can do better.

the worker plugin basically indicates if a job has finished. this happens only after the result file is written (softlock comes into play). the issue we were getting was that the file on the filesystem is not available to another process (because of filesystem latencies), because multiproc has passed info back to the main thread that job has finished. because of filesystem latencies it can still be the case that even the lock file has not come back up to the client, but this could only be handled with a timeout delay.

@stilley2
Copy link
Contributor

Sorry I guess I shouldn't have said race condition. What I mean is that the directory can still be changed between the indirectory context manager, and the loading of the results file itself. To recap, the only way to fix that is to either not change directories in loadpkl, or remove chdir from _async_callback. My personal preference is towards the latter, as I think all directory changes in the main process should be obvious and that having async directory changes will come back to bite us again in the future. However, I don't know why it was but there in the first place, and whether removing it will break something.

@oesteban
Copy link
Contributor

for loading, we can resolve the new path, without changing directories. i don't see the need for switching directories.

Sorry for joining late to the party. Unfortunately, this is not possible because we are unpickling traits with the exists=True metadata set most of the times. If we considered a serialization method other than pickling, or workaround the set_value of FileTraits when setting state then we could do it from some other directory.

@oesteban
Copy link
Contributor

In fact, can we just remove that chdir in _async_callback? It was added in bcc25fd, but maybe it's not needed anymore.

I would not remove this chdir for the reasons stated in that commit message.

Making it possible to save/load results without changing directories while keeping the rebase/resolve feature would be preferred. But that would entail a little refactor of how the BasePath traits are pickled (probably adding metadata to store the full path, and make it modifiable while unpickling).

@stilley2
Copy link
Contributor

So I was looking again at bcc25fd, and there seem to be two main changes

  1. Initialize working by calling os.chdir(self._cwd)
  2. After the working finishes, call os.chdir(self._cwd)

Regarding 2, the comment is to ensure that the "runtime is not left at a dubious working directory". How would this be possible, as any chdir in the execution of the node/working would not be reflected in the main process, where _async_callback is called.

Regarding 1, it is unclear to me why this is required, as wouldn't the Pool start in the current directory anyways? And I can't see how the cwd would change between L141 where we have self._cwd = os.getcwd() and L160 where that value is passed to the working.

So I guess in summary I can't see how removing the chdir in _async callback (2) would change anything, but I also don't see how the working initialization (1) changes anything either, and clearly it must have if bcc25fd solved some bug, so I'm probably misunderstanding something.

I hope my stubbornness isn't out of line, as I'm an outsider on this project, but I really feel the asynchronous chdir is asking for trouble, and as someone who as started using (and really liking) nipype for their pipelines, I would feel much better at the very least really understanding this issue.

Thanks!

@oesteban
Copy link
Contributor

I hope my stubbornness isn't out of line

Not at all, thanks for your patience. I've been checking out the commit in more detail - and you are right, that chdir is asking for trouble.

@oesteban
Copy link
Contributor

Hi @satra, any news on this? Could you find a way of unpickling from another dir?

@satra
Copy link
Member Author

satra commented Sep 30, 2019

@oesteban - sorry i haven't had a chance to look at this.

Copy link
Contributor

@oesteban oesteban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to think about how we can work around the unpickling in the interface's folder.

# Could not get version info
pkl_file.seek(0)
try:
unpkl = pickle.load(pkl_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still needs to be done in the folder where outputs required to exist are.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the other chdir?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error checking MapNode hash with fMRIPrep
4 participants