Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Use safe rename which works across filesystems when writing checkpoints #2225

Merged

Conversation

dantreiman
Copy link
Contributor

On some systems, the temp directory and model output directory live on different filesystems. When we write checkpoint to temp directory, then os.replace or os.rename to the destination, we get:

    134 # replace is an atomic operation in python
    135 # it is POSIX compliant according to docs
    136 # https://docs.python.org/3/library/os.html#os.replace
--> 137 os.replace(tmp_path, save_path)
    138 logging.debug(f"Saved checkpoint at {save_path}.")

OSError: [Errno 18] Invalid cross-device link: '/tmp/tmphmthsa9s/temp.ckpt' -> '/tf/app/.../model/training_checkpoints/latest.ckpt'

This PR adds safe_move_file, which attempts to os.replace(), and does an atomic shutil.copyfile if we get the cross-filesystem error (errno.EXDEV == 18).

@github-actions
Copy link

Unit Test Results

       5 files   -     1         5 suites   - 1   1h 51m 14s ⏱️ - 45m 56s
2 908 tests ±    0  2 862 ✔️ ±    0    46 💤 ±  0  0 ±0 
8 556 runs   - 168  8 437 ✔️  - 145  119 💤  - 23  0 ±0 

Results for commit 696ec6d. ± Comparison against base commit 9d7da5f.

@dantreiman dantreiman merged commit bc206f3 into ludwig-ai:master Jul 5, 2022
@dantreiman dantreiman deleted the daniel/copy_across_filesystems branch July 5, 2022 18:44
arnavgarg1 pushed a commit that referenced this pull request Aug 2, 2022
…checkpoints (#2225)

* Implement safe rename which works across filesystems.

* os.replace

Co-authored-by: Daniel Treiman <daniel@predibase.com>
arnavgarg1 pushed a commit that referenced this pull request Aug 2, 2022
…checkpoints (#2225)

* Implement safe rename which works across filesystems.

* os.replace

Co-authored-by: Daniel Treiman <daniel@predibase.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants