-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't unmount and re-mount in same process #4
Comments
(cc Waleed)
I presume that shutdown_fs not setting initialized back to 0 is a bug.
There likely currently is no way to cleanly shut down kernfs from an
external process, but you should be able to add a TERM signal handler that
does a clean shutdown. That should allow you to shut it down cleanly by
sending SIGTERM.
…-- Simon
On Wed, Apr 21, 2021 at 3:49 PM hayley-leblanc ***@***.***> wrote:
Hi,
I am actually having this issue with Strata, but I think it is in Assise
as well. I have a program that attempts to perform the following set of
steps two times:
1. Initialize a new instance of Strata (calls mkfs on two emulated PM
devices, run a command to set up kernfs, run init_fs to start the
libfs)
2. Create, fsync, and close a file in Strata
3. Unmount Strata by calling shutdown_fs() and killing the kernfs
process
In the first iteration, everything works as expected. The second time,
when I get to step 2, the process is just killed. It appears to be killed
while Strata is trying to open the file, because none of my error handling
code after the open() call runs. Strata doesn't print any error messages.
I noticed that the LibFS only initializes the file system in init_fs() if
a variable initialized is 0. init_fs() sets this variable to 1, but
shutdown_fs() doesn't set it back to 0. Is this intentional? I added a
line in shutdown_fs() so that initalized is set to 0 when the system is
shut down, and things started working as expected.
Also - is there a way to shut down kernfs cleanly from an external
process? I see that it has a shutdown_fs() function but I don't
immediately see a way to invoke it externally, and I'd like to be able to
umount kernfs after running arbitrary workloads.
Thanks!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMXCZOOKBILRQQCIZWDTJ426HANCNFSM43LDR24A>
.
|
This is indeed a bug. Thanks for pointing it out. I'll fix this in an upcoming patch. We didn't get around to implementing mount/umount, so Simon's suggestion is sensible here. If you manage to come up with an implementation for these commands, I'd be more than happy to integrate it. |
Thanks! I have another question about Strata, if that's alright. I'm currently working on a tool to test PM file systems for crash consistency and we are currently extending it to Strata. I've encountered some unexpected behavior and I'd like to see if it's correct or not. I have been able to trigger it without this too with the following steps. Here's what I'm doing:
By ending the first program without Since Strata is synchronous, I would expect /mlfs/foo to be present in the second program even though libfs and kernfs don't shut down correctly. Is that correct? |
Your assumption is correct. What I think should happen is that each libfs
gets linked somewhere in persistent file system state (likely the
superblock) and that, each time kernfs starts, kernfs first replays any log
contents from the set of previously open libfs update logs, as identified
by the superblock. My bet is that it's not fully implemented.
Henry (cc'ed) did experiments that should involve these or similar steps.
He might have some pointers for you as to how to get the proper behavior.
…On Fri, Apr 23, 2021 at 2:31 PM hayley-leblanc ***@***.***> wrote:
Thanks!
I have another question about Strata, if that's alright. I'm currently
working on a tool to test PM file systems for crash consistency and we are
currently extending it to Strata. I've encountered some unexpected behavior
and I'd like to see if it's correct or not. I have been able to trigger it
without this too with the following steps.
Here's what I'm doing:
1. Set up strata and run a program that calls init_fs() to start
libfs, then creates a file /mlfs/foo. This program does *not* call
shutdown_fs().
2. Kill kernfs (without a TERM handler set up, so it doesn't run
shutdown_fs() on the kernfs side either).
3. Start kernfs up again.
4. Run a program that starts libfs and attempts to stat /mlfs/foo.
By ending the first program without shutdown_fs() and by killing kernfs,
I think this set of steps essentially injects a power-loss crash or similar
after the creation of /mlfs/foo. In the second program, the stat call on
/mlfs/foo fails and returns -2. errno is 0 after the stat call. If I try
to open /mlfs/foo instead of calling stat, the same thing happens
(returns -2, errno is 0 after the call) and Strata prints "incorrect fd
-2: file /mlfs/foo".
Since Strata is synchronous, I would expect /mlfs/foo to be present in the
second program even though libfs and kernfs don't shut down correctly. Is
that correct?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMSPMAUFSGTZMJS6N7TTKHDGRANCNFSM43LDR24A>
.
|
Thanks for the update. We'd like to try to test the parts of the crash recovery code that have been implemented - could you point us towards what those might be? Thanks! |
cc Henry and Waleed, who should be able to tell you.
…On Fri, Apr 30, 2021 at 9:06 AM hayley-leblanc ***@***.***> wrote:
Thanks for the update.
We'd like to try to test the parts of the crash recovery code that have
been implemented - could you point us towards what those might be?
Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMTUOKZ76T6ZIZUDHVTTLK2LRANCNFSM43LDR24A>
.
|
Hi all - hopefully this email response will propagate through to GitHub...
The Assise artifact doesn’t include a generalized implementation of recovery/reconfiguration; we instead set up specific experiments to demonstrate the scenarios in the paper. If there is a specific scenario you’re interested in studying, I am happy to provide input on how to set it up.
Best,
Henry
… On Apr 30, 2021, at 6:03 PM, Simon Peter ***@***.*** ***@***.***>> wrote:
cc Henry and Waleed, who should be able to tell you.
On Fri, Apr 30, 2021 at 9:06 AM hayley-leblanc ***@***.*** ***@***.***>> wrote:
Thanks for the update.
We'd like to try to test the parts of the crash recovery code that have been implemented - could you point us towards what those might be?
Thanks!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHQBMTUOKZ76T6ZIZUDHVTTLK2LRANCNFSM43LDR24A>.
|
Thanks! I think the scenario we're trying to set up is most similar to the OS failover experiment described in the paper; we're basically simulating power-loss crashes. I'd love some info on how you set up and ran that experiment. We are not looking at distributed file systems at the moment, so I have been working with Strata so far; does Strata have the same recovery mechanisms implemented as Assise, or should I switch to a local-only instance of Assise? Thanks again for your help! |
Assise will be the better choice. Henry or Waleed will be able to help you
with setup.
…On Fri, May 7, 2021 at 11:57 AM hayley-leblanc ***@***.***> wrote:
Thanks! I think the scenario we're trying to set up is most similar to the
OS failover experiment described in the paper; we're basically simulating
power-loss crashes. I'd love some info on how you set up and ran that
experiment. We are not looking at distributed file systems at the moment,
so I have been working with Strata so far; does Strata have the same
recovery mechanisms implemented as Assise, or should I switch to a
local-only instance of Assise?
Thanks again for your help!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHQBMUIGBIOHP43KTGJA5TTMQLWJANCNFSM43LDR24A>
.
|
I agree that in general, Assise would be the best choice here, given the range of fixes since the Strata’s release.
Just to be clear - the Strata release never supported replaying logs after a crash, and Assise’s support is limited to my testing for the OS failover experiment. Other workloads may be buggy. Unlike Strata, Assise’s distributed failure scenario, where another replica takes over the workload, requires throwing out any undigested logs during recovery instead of replaying the logs.
That said, here’s how I set up the single-node/OS failover experiment. SharedFS and the app/LibFS process are killed, then both restarted. When the processes restart, SharedFS digests any old log entries which weren’t digested before failure.
To measure this log recovery time and subsequently start the app workload, I added a synchronous digest request to LibFS’s init_log (log.c). This request uses the (old, crashed) log’s start_digest and n_digest from the log superblock to digest any remaining log entries from the crashed process. This takes place before init_log() clears the log superblock for the new process.
Note that to support recovery with multiple processes, this logic should be moved from init_log() to SharedFS, which would check all LibFS logs, before allowing any LibFS to finish init_fs().
… On May 8, 2021, at 10:09 AM, Simon Peter ***@***.***> wrote:
Assise will be the better choice. Henry or Waleed will be able to help you with setup.
On Fri, May 7, 2021 at 11:57 AM hayley-leblanc ***@***.*** ***@***.***>> wrote:
Thanks! I think the scenario we're trying to set up is most similar to the OS failover experiment described in the paper; we're basically simulating power-loss crashes. I'd love some info on how you set up and ran that experiment. We are not looking at distributed file systems at the moment, so I have been working with Strata so far; does Strata have the same recovery mechanisms implemented as Assise, or should I switch to a local-only instance of Assise?
Thanks again for your help!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHQBMUIGBIOHP43KTGJA5TTMQLWJANCNFSM43LDR24A>.
|
Awesome, thank you! I'll try getting Assise set up and replicating that experiment. I'll reach out if I have any problems. I'll leave this issue open for now until the original bug I reported is patched. Thanks again for all your help! |
Hi,
I am actually having this issue with Strata, but I think it is in Assise as well. I have a program that attempts to perform the following set of steps two times:
mkfs
on two emulated PM devices, run a command to set up kernfs, runinit_fs
to start the libfs)fsync
, and close a file in Stratashutdown_fs()
and killing the kernfs processIn the first iteration, everything works as expected. The second time, when I get to step 2, the process is just killed. It appears to be killed while Strata is trying to open the file, because none of my error handling code after the
open()
call runs. Strata doesn't print any error messages.I noticed that the LibFS only initializes the file system in
init_fs()
if a variableinitialized
is 0.init_fs()
sets this variable to 1, butshutdown_fs()
doesn't set it back to 0. Is this intentional? I added a line inshutdown_fs()
so thatinitalized
is set to 0 when the system is shut down, and things started working as expected.Also - is there a way to shut down kernfs cleanly from an external process? I see that it has a
shutdown_fs()
function but I don't immediately see a way to invoke it externally, and I'd like to be able to umount kernfs after running arbitrary workloads.Thanks!
The text was updated successfully, but these errors were encountered: