Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nipype support with fuse filesystems #3288

Open
sulantha2006 opened this issue Jan 7, 2021 · 6 comments · May be fixed by #3289
Open

Nipype support with fuse filesystems #3288

sulantha2006 opened this issue Jan 7, 2021 · 6 comments · May be fixed by #3289

Comments

@sulantha2006
Copy link
Contributor

Summary

Support for FUSE filesystem for workdir

Actual behavior

Workdir reads and writes many intermediate files, with higher than normal latency. This leads to workflows failing due to files not present. (FUSE file systems keep a copy of the file in RAM and writes once the handle is closed. But the actual time it takes the files to be present for a new process can vary).

Expected behavior

Nipype, should have a delay mechanism on file read fails possibly with exponential backoff.

Platform details:

I tested this in GCP, with gcsfuse. There were random failures of workflow files not being present.

Execution environment

GCP machine, with gcsfuse mounted workdir. Singularity container installed with required tools with nipype.

@satra
Copy link
Member

satra commented Jan 7, 2021

@sulantha2006 - there is an execution config option (job_finished_timeout) that you can increase checking results files and other things.

https://miykael.github.io/nipype_tutorial/notebooks/basic_execution_configuration.html#Execution

@sulantha2006
Copy link
Contributor Author

Sure I will try, but is this param works when SGE/PBS or any other execution plugin is not used? I run the pipeline as workflow.run().

@satra
Copy link
Member

satra commented Jan 7, 2021

any distributed plugin if i remember correctly, but may have been created specifically for slurm/pbs/etc. however, i would not expect an issue with fuse necessarily on a single system (since the local fuse mapping should always be ok), only across systems.

but perhaps you are noticing this on a single system, and not using a batch system like slurm. are you running into this error when running the container in a given machine, with simply pointing at a working directory that is over fuse. one option would be to make the work dir local and then copy it over if necessary.

@sulantha2006
Copy link
Contributor Author

sulantha2006 commented Jan 7, 2021 via email

@satra
Copy link
Member

satra commented Jan 7, 2021

@sulantha2006 - thank you for the details. in this case it is then interesting that the failure happens - i would not have expected this given how things work.

also the job_finished_timeout will not help you here. the local job is likely using multiproc plugin, which does not have such a wait period. this would require a retry-timeout check in the _get_inputs function.

https://github.com/nipy/nipype/blob/master/nipype/pipeline/engine/nodes.py#L552

every node retrieves its inputs to evaluate the hash before computing or moving on. so this would be the place to check.

sulantha2006 added a commit to sulantha2006/nipype that referenced this issue Jan 7, 2021
…he node. Simplified implementation for all inputs (not just File type). Need to be improved for File type later on.
sulantha2006 added a commit to sulantha2006/nipype that referenced this issue Jan 7, 2021
…he node. Simplified implementation for all inputs (not just File type). Need to be improved for File type later on.
@sulantha2006 sulantha2006 linked a pull request Jan 7, 2021 that will close this issue
1 task
sulantha2006 added a commit to sulantha2006/nipype that referenced this issue Jan 8, 2021
@sulantha2006
Copy link
Contributor Author

I am not sure the above PR solution would work. It will solve the issue of input files not being ready surely, but I am getting other fails now on the same root cause. See the above exception. Note that as before this is completely random.

/process_workdir/ is the gcsfuse mounted folder.


Traceback: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/nipype/pipeline/plugins/linear.py", line 46, in run
    node.run(updatehash=updatehash)
  File "/usr/local/lib/python3.9/site-packages/nipype/pipeline/engine/nodes.py", line 512, in run
    write_node_report(self, is_mapnode=isinstance(self, MapNode))
  File "/usr/local/lib/python3.9/site-packages/nipype/pipeline/engine/utils.py", line 117, in write_node_report
    report_file.parent.mkdir(exist_ok=True, parents=True)
  File "/usr/local/lib/python3.9/pathlib.py", line 1312, in mkdir
    self._accessor.mkdir(self, mode)
OSError: [Errno 5] Input/output error: '/process_workdir/GCP/MK_BL_Proc/902-234_Baseline/temporary/main_wf/t1_wf/t1_extract/_report'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants