Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contact file creation: fsync parent directory #2943

Conversation

matthewrmshin
Copy link
Contributor

On file system such as an NFS mounted file system, we may have a delay
before the file becomes visible from other process running on other
host. We already have a 1st fsync to ensure that the data of the contact
file is written to disk. This change adds a 2nd fsync to ensure that the
file metadata of the contact file is written as well - by doing an fsync
on its directory.

(The issue is causing rose suite-run to fail to launch cylc gui after calling cylc run successfully to start up a suite.)

(Need to back port to 7.8.2.)

@matthewrmshin matthewrmshin added this to the next-release milestone Feb 5, 2019
@matthewrmshin matthewrmshin self-assigned this Feb 5, 2019
@matthewrmshin matthewrmshin force-pushed the fsync-dot-service-after-creating-contact-file branch from ed5a05c to 6809e2a Compare February 5, 2019 14:19
On file system such as an NFS mounted file system, we may have a delay
before the file becomes visible from other process running on other
host. We already have a 1st fsync to ensure that the data of the contact
file is written to disk. This change adds a 2nd fsync to ensure that the
file metadata of the contact file is written as well - by doing an fsync
on its directory.
@matthewrmshin matthewrmshin force-pushed the fsync-dot-service-after-creating-contact-file branch from 6809e2a to 57cb26b Compare February 5, 2019 15:57
Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to replicate issue, change looks sensible.

@matthewrmshin
Copy link
Contributor Author

metomi/rose#2290 is companion of this PR.

@kinow
Copy link
Member

kinow commented Feb 6, 2019

+1, also cannot replicate, but looks OK. Also spent some minutes reading about fsync, NFS, NFS' cache, why I never had to bother about it with Java (I believe calling .flush() on OutputStream has a fsync hidden somewhere in the JVM/JDK libs).

All checks passed, merging!

@kinow kinow merged commit 268270e into cylc:master Feb 6, 2019
@matthewrmshin matthewrmshin deleted the fsync-dot-service-after-creating-contact-file branch February 6, 2019 21:09
@matthewrmshin matthewrmshin mentioned this pull request Feb 8, 2019
@kinow
Copy link
Member

kinow commented Feb 10, 2019

@matthewrmshin watching the Fosdem talks from this year, and found this good one about an old issue in Postgres with fsync: PostgreSQL vs. fsync - How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it.

Give away, they are planning to tackle this issue over 2 or 3 years 🙂 and quite interesting the issues with Kernel, memory pages, fsync, and even different behaviour with different linux file systems.

Decided to see if anybody ever had issues with fsync in Python, and looks like there is a 8 years old bug: Change os.fsync() to support physical backing store syncs, which doesn't seem to have had any work since 2013, and doesn't look to have an easy fix.

So I think that if some day some user comes with some crash in Cylc, and complains that it looks like there was an error writing to the disk, and he is using NFS, then it is possible that there was an error on flush/fsync/write/etc, but it depends on operating system/file system/kernel version/python version/NFS versions/etc. And there are situations when we just cannot explain what happened.

@hjoliver
Copy link
Member

(The issue is causing rose suite-run to fail to launch cylc gui after calling cylc run successfully to start up a suite.)

BTW I've noticed this many times in the past, but never looked into it (easy enough to manually fire up cylc gui...)

@matthewrmshin
Copy link
Contributor Author

Yes, but users panic and keep blaming the new version.

@hjoliver
Copy link
Member

Not saying it shouldn't be fixed - just highlighting my own tardiness on never following this one up!

@matthewrmshin
Copy link
Contributor Author

Watched the video at the link posted by @kinow - all 40-something minutes of it. While not the problem here, it does highlight how little understanding I have with fsync. Yes, what happens if fsync returns an error? (I think we are OK here - the suite should fail to start.) But what if fsync does not return an error when there is one? (Then we are stuffed - but the suite should die on health check.)

@kinow
Copy link
Member

kinow commented Feb 21, 2019

At least we are not alone @matthewrmshin . I am reading the Notebook documentation around security, authentication, etc. And found the note

We are aware that SQLite doesn’t work well on NFS and we are working out better ways to do this.

The "working out better ways" links to this issue: jupyter/notebook#1782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants