WIP: Check size of input files to handle nfs sync issues #62

mirkovogel · 2021-07-02T09:37:12Z

After a discussion with @critias , we opted for the following design, handling sync issues both between jobs and between tasks.

When a task finishes, the LoggingThread writes a the size and the mtime of all files below work and output to the ressources file (-> Job._sis_get_file_stats)
When a task is set up, it calls Task._wait_for_input_to_sync. There the expected sizes are obtained by calling Job._sis_get_expected_file_sizes(job_dir, task) ...
- for all jobs it depends upon, if it is the first task (then only the files lists as inputs are retained)
- for the preceding task otherwise.
The expected file sizes are then compared to the actual sizes.

There are two new config keys:

WAIT_PERIOD_CHECK_FILE_SIZE
MAX_WAIT_FILE_SYNC

curufinwe · 2021-07-02T12:26:55Z

Note that the finished file is also put in the finished.tar.gz file at job cleanup.

critias

The overall approach looks good to me, let's see if it works as expected.

sisyphus/job.py

sisyphus/task.py

mirkovogel · 2021-07-07T10:47:29Z

Side note: This change does not break old setups. If no size info is available for a given job / task, no checks are run.

critias · 2021-07-07T13:35:08Z

sisyphus/global_settings.py

@@ -210,6 +210,10 @@ def file_caching(path):
 WAIT_PERIOD_JOB_CLEANUP = 10
 #: How many seconds should all inputs be available before starting a job to avoid file system synchronization problems
 WAIT_PERIOD_MTIME_OF_INPUTS = 60


WAIT_PERIOD_MTIME_OF_INPUTS can be removed since the check inside of task is removed. Setting it should also be removed in toolkit.setup_script_mode

critias · 2021-07-08T14:12:30Z

sisyphus/job.py

+        except AttributeError:
+            job_dir = job
+
+        if os.path.exists(os.path.join(job_dir, gs.JOB_FINISHED_ARCHIVE)):


This could alternatively also be checked doing something like this:

import tarfile tf=tarfile(os.path.join(job_dir, gs.JOB_FINISHED_ARCHIVE)) for n in tf: if n.path.startswith('usage.'): usage = literal_eval(tf.extractfile(n.path).read())

but I'm not sure if this would be worth the effort. The given solution should work in nearly all cases if the clean up timeout is large enough.

critias · 2021-07-08T14:16:39Z

sisyphus/task.py

+                # If the job has been cleaned up, no size info is available, but we can safely
+                # assume that enough time has passed so that all files are synced.
+                if other_job_sizes:
+                    expected_sizes[i.rel_path()] = other_job_sizes[i.rel_path()]


This fails if a path is only used as prefix without representing a real file. We could change it to something like this:

try: expected_sizes[rel_path] = other_job_sizes[rel_path] except KeyError: for k, v in other_job_sizes.items(): if k.startswith(rel_path): expected_sizes[k] = v

This could also be used if the path is pointing to a directory.

critias · 2021-07-08T14:24:16Z

sisyphus/job.py

+                if time.time() - start > timeout:
+                    logging.error("%s not synced for more than %ds, file_stats still empty.", fn, timeout)
+                    raise TimeoutError
+                logging.info("%s not synced yet, file_stats still empty.", fn)


This will cause problems if a path is available and should be used before the job is finished e.g. training of a neural model.
We could require that these paths have a special attribute set. They could then be excluded from this check.

critias · 2021-07-12T07:27:19Z

sisyphus/job.py

+            while True:
+                with open(fn) as f:
+                    try:
+                        stats = literal_eval(f.read())["file_stats"]


It can happen that this raises a SyntaxError if the file is accessed while it's being written.

critias · 2021-07-12T07:53:06Z

sisyphus/task.py

+        logging.info("Inputs:\n%s", "\n".join( str(i) for i in self._job._sis_inputs))
+
+        try:
+            self._wait_for_input_to_sync()


It would be could to be able to switch off the new check with an entry in the setting file if something goes wrong. Even better: switch back to the old timeout.

critias · 2021-07-12T07:57:26Z

sisyphus/task.py

+        try:
+            self._wait_for_input_to_sync()
+        except TimeoutError:
+            self.error(task_id, True)


This doesn't really stop the task from running. It's sets the error marker and then continues. Once the task is finished a finished marker is set and the error marker is ignored.

mirkovogel · 2021-07-15T16:27:45Z

@critias : I wanted to get this PR merged before "vacation" (=no kindergarden) starts, which didn't happen because I spent last week in bed, not in front of a computer screen. As I won't be able to do so until 8/13, I invite you to take over this PR. :-)

mirkovogel · 2021-09-13T19:07:30Z

@critias I got sidetracked for quite some time ... How about the current situation of the cluster? Is it running so smoothly that this PR has become obsolete? Starting next week I'd have time to implement the changes you suggested.

critias · 2021-09-14T08:51:28Z

I continued working on it here: https://github.com/rwth-i6/sisyphus/tree/check-output-size
It seems to be working ok for me, but I got also sidetracked since the overall situation got better. Let's have a call discuss how to continue from here.

Initial design: Wait for files to sync

0f2e67b

mirkovogel force-pushed the mvogel-check-output-size branch from 494f778 to 0f2e67b Compare July 5, 2021 10:54

Set error state on timeout

84834d3

critias reviewed Jul 5, 2021

View reviewed changes

sisyphus/job.py Show resolved Hide resolved

sisyphus/task.py Outdated Show resolved Hide resolved

mirkovogel added 4 commits July 6, 2021 16:56

Fixing typos

e1e7854

More verbose logging

28ea78e

Fixes (works with LocalEngine)

91be9b9

Better log message

d01ed81

critias reviewed Jul 7, 2021

View reviewed changes

critias reviewed Jul 8, 2021

View reviewed changes

critias reviewed Jul 12, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Check size of input files to handle nfs sync issues #62

WIP: Check size of input files to handle nfs sync issues #62

mirkovogel commented Jul 2, 2021 •

edited

Loading

curufinwe commented Jul 2, 2021

critias left a comment

mirkovogel commented Jul 7, 2021

critias Jul 7, 2021

critias Jul 8, 2021 •

edited

Loading

critias Jul 8, 2021

critias Jul 8, 2021

critias Jul 12, 2021

critias Jul 12, 2021

critias Jul 12, 2021

mirkovogel commented Jul 15, 2021

mirkovogel commented Sep 13, 2021

critias commented Sep 14, 2021

WIP: Check size of input files to handle nfs sync issues #62

Are you sure you want to change the base?

WIP: Check size of input files to handle nfs sync issues #62

Conversation

mirkovogel commented Jul 2, 2021 • edited Loading

curufinwe commented Jul 2, 2021

critias left a comment

Choose a reason for hiding this comment

mirkovogel commented Jul 7, 2021

critias Jul 7, 2021

Choose a reason for hiding this comment

critias Jul 8, 2021 • edited Loading

Choose a reason for hiding this comment

critias Jul 8, 2021

Choose a reason for hiding this comment

critias Jul 8, 2021

Choose a reason for hiding this comment

critias Jul 12, 2021

Choose a reason for hiding this comment

critias Jul 12, 2021

Choose a reason for hiding this comment

critias Jul 12, 2021

Choose a reason for hiding this comment

mirkovogel commented Jul 15, 2021

mirkovogel commented Sep 13, 2021

critias commented Sep 14, 2021

mirkovogel commented Jul 2, 2021 •

edited

Loading

critias Jul 8, 2021 •

edited

Loading