Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: Display previous failure of ostree-finalize-staged #1567

Closed
cgwalters opened this issue Sep 20, 2018 · 16 comments
Closed

status: Display previous failure of ostree-finalize-staged #1567

cgwalters opened this issue Sep 20, 2018 · 16 comments
Assignees
Labels
jira for syncing to jira kind/enhancement

Comments

@cgwalters
Copy link
Member

We should run the equivalent of journalctl -b -1 -u ostree-finalize-staged and note very prominently in status if it failed.

A Silverblue user has a slow hard drive and had systemd kill the service.

@jlebon
Copy link
Member

jlebon commented Sep 20, 2018

A Silverblue user has a slow hard drive and had systemd kill the service.

Hmm, is this a legitimate case for still allowing a StageDeployments=no setting? Regardless, we should strive for making staging work for all environments (e.g. bump the timeout?). But some users might still prefer paying the finalization cost upfront rather than at shutdown if it's an expensive operation on their systems.

@dustymabe
Copy link
Member

I'm hitting this today on the latest stable release of F28AH: 28.20180923.0 with rpm-ostree-2018.8-1.fc28.x86_64. I run this on a Digital Ocean node (DO uses SSD storage). I'm running openshift on top and it takes a while for these nodes to reboot because of that.

I first ran rpm-ostree install oci-kvm-hook. Before reboot I see:

[root@origin-master-1 ~]# rpm-ostree status                                                                                                                                                                                                      
State: idle                                                                                                                                                                                                
AutomaticUpdates: disabled                                                                                                                                                                                                                  
Deployments:                                                                                                                                                                                                                                
  ostree://fedora-atomic:fedora/28/x86_64/atomic-host                                                                                                                                                                                       
                   Version: 28.20180923.0 (2018-09-23 18:54:39)                                                                                                                                                                             
                BaseCommit: 8b82902a0ca203645e9e288b8a0bbf9fe525749ec67cd674e6715daa64e5b7dd                                                                                                                                                
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1                                                                                                                                                          
           LayeredPackages: oci-kvm-hook                                                                                                                                                                                                        
                                                                                                                                                                                                                                            
● ostree://fedora-atomic:fedora/28/x86_64/atomic-host                                                                                                                                                                                       
                   Version: 28.20180923.0 (2018-09-23 18:54:39)                                                                                                                                                                             
                    Commit: 8b82902a0ca203645e9e288b8a0bbf9fe525749ec67cd674e6715daa64e5b7dd                                                                                                                                                         
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1                                                                                                                                                     
                                                                                                                                                                                                                                                    
  ostree://fedora-atomic:fedora/28/x86_64/atomic-host                                                                                                                                                                                                                                                                        
                   Version: 28.20180902.0 (2018-09-02 04:10:33)                                                                                                                                                                                                                                                               
                    Commit: 24d4ceb612af5abfa84f275ca9d8aff862e69d47aaeb177c0d64f30865fd7a9f                                                                                                                                                
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

After reboot the new deployment is lost:

[root@origin-master-1 ~]# rpm-ostree status
State: idle
AutomaticUpdates: disabled
Deployments:
● ostree://fedora-atomic:fedora/28/x86_64/atomic-host
                   Version: 28.20180923.0 (2018-09-23 18:54:39)
                    Commit: 8b82902a0ca203645e9e288b8a0bbf9fe525749ec67cd674e6715daa64e5b7dd
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

  ostree://fedora-atomic:fedora/28/x86_64/atomic-host
                   Version: 28.20180902.0 (2018-09-02 04:10:33)
                    Commit: 24d4ceb612af5abfa84f275ca9d8aff862e69d47aaeb177c0d64f30865fd7a9f
              GPGSignature: Valid signature by 128CF232A9371991C8A65695E08E7E629DB62FB1

I see the timeout in the journal:

[root@origin-master-1 ~]# journalctl -b -1 -u ostree-finalize-staged
-- Logs begin at Sat 2018-09-29 08:08:05 UTC, end at Tue 2018-10-02 21:14:03 UTC. --
Oct 02 20:59:01 origin-master-1 systemd[1]: Started OSTree Finalize Staged Deployment.
Oct 02 21:02:02 origin-master-1 systemd[1]: Stopping OSTree Finalize Staged Deployment...
Oct 02 21:02:03 origin-master-1 ostree[21246]: Copying /etc changes: 32 modified, 0 removed, 500 added
Oct 02 21:03:32 origin-master-1 systemd[1]: ostree-finalize-staged.service: Stopping timed out. Terminating.
Oct 02 21:03:32 origin-master-1 systemd[1]: ostree-finalize-staged.service: Control process exited, code=killed status=15
Oct 02 21:03:32 origin-master-1 systemd[1]: ostree-finalize-staged.service: Failed with result 'timeout'.
Oct 02 21:03:32 origin-master-1 systemd[1]: Stopped OSTree Finalize Staged Deployment.

@jlebon
Copy link
Member

jlebon commented Oct 2, 2018

OK, will take a look at this! (I've already gotten some experience with the journal API from experimenting with it for #1489).

@dustymabe
Copy link
Member

any advice for getting out of the stuck situation I'm in? can I disable staged deployments for now ?

@jlebon
Copy link
Member

jlebon commented Oct 2, 2018

(BTW, one way you can "work around" this for now is to just run ostree admin finalize-staged manually after it's staged).

@jlebon jlebon self-assigned this Oct 2, 2018
@dustymabe
Copy link
Member

Regardless, we should strive for making staging work for all environments (e.g. bump the timeout?)

is there a way to make the finalize run after all the other riffraff from the system has been stopped or killed?

@dustymabe
Copy link
Member

btw kudos for making that failure not blow up the system

@ajeddeloh here is an example where the old deployment (i.e. N-2) wasn't cleaned up because the new deployment failed to be finalized. I know that's something we discussed recently.

@jlebon jlebon added the jira for syncing to jira label Oct 3, 2018
@jlebon
Copy link
Member

jlebon commented Oct 3, 2018

Was hacking on this, though... it turns out there's no really good journal entry to check here (see systemd/systemd#10265). We could definitely use heuristics, e.g. by priority or trying to parse MESSAGE, though I'm inclined to just make the service unit drop a file in /var which gets cleaned up after ostree admin finalize-staged. So then if that file doesn't get cleaned up, we know it failed?

@dustymabe
Copy link
Member

I almost wonder.. should we take a step back from staged deployments and take a two pronged approach?

  • rpm-ostree upgrade(or install or override)
  • stage deployment during operation (like we used to; do /etc/ merge here) this is our fallback if the operation during shutdown fails for any reason
  • set up staged deployment for shutdown, this is what we have today, but if it fails then we fallback to the staged deployment that happened during the rpm-ostree operation.

i.e. in the case that we fail to finalize the staged deployment (new behavior) on shutdown we just fallback to a deployment that was created during the operation (old behavior). Any changes to /etc/ between the upgrade and reboot would be lost (old behavior).

This could produce some inconsistent behavior so maybe not a good idea, but just something that popped into my head as I was thinking about it.

@jlebon
Copy link
Member

jlebon commented Oct 4, 2018

Hmm, or finding ways to do more work upfront and making finalization cheaper? E.g. write to /boot during staging, but not actually swapping the symlink until finalization. Or somehow doing an initial /etc merge and finding a cheaper way to do an "incremental" merge at finalization?

@cgwalters
Copy link
Member Author

I have a suspicion that a factor here is the syncfs() invocations. We should log the timing for those in the journal too.

@cgwalters
Copy link
Member Author

i.e. in the case that we fail to finalize the staged deployment (new behavior) on shutdown we just fallback to a deployment that was created during the operation (old behavior). Any changes to /etc/ between the upgrade and reboot would be lost (old behavior).

Not a huge fan of randomly swapping bugs.

@cgwalters
Copy link
Member Author

I'm inclined to just make the service unit drop a file in /var which gets cleaned up after ostree admin finalize-staged. So then if that file doesn't get cleaned up, we know it failed?

SGTM

jlebon added a commit to jlebon/rpm-ostree that referenced this issue Oct 5, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: coreos#1567
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Oct 11, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: coreos#1567
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: #1567

Closes: #1601
Approved by: cgwalters
jlebon added a commit to jlebon/rpm-ostree that referenced this issue Oct 12, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: coreos#1567
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: #1567

Closes: #1601
Approved by: cgwalters
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: #1567

Closes: #1601
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
Sample output:

```
$ rpm-ostree status
State: idle
Warning: failed to finalize previous deployment
         check `journalctl -b -1 -u ostree-finalize-staged.service`
AutomaticUpdates: disabled
...
```

(Though open to tweaking it).

I also played with directly invoking `journalctl` for the user, but that
can get really spammy with e.g. `os-prober` output and such.

I wrote this in Rust using journal API wrappers because I also plan to
implement the `history` command in Rust and will also enhance that new
`journal` module there for that.

Requires: ostreedev/ostree#1750
Requires: codyps/rust-systemd#54
(Though I've pointed the manifest at my branch for now for CI).

Closes: #1567

Closes: #1601
Approved by: jlebon
@victortoso
Copy link

This is marked as closed but I see this on Fedora 29 silverblue, rpm-ostree version '2019.1' - That is, I thought I had my system up to date but due some failures, deployment failed. I only realized it now after trying to install a new package and, well, could not use it after reboot.

Shouldn't rpm-ostree install warn about it as suggested in fist comment?

@jlebon
Copy link
Member

jlebon commented Feb 4, 2019

Did rpm-ostree status after rebooting say there was an error, though? If not, you may not have access to the system journal. We're thinking of moving the journal querying to the daemon in #1729, which should make this more reliable.

Shouldn't rpm-ostree install warn about it as suggested in fist comment?

The finalization happens at reboot time, so rpm-ostree install wouldn't be able to warn users before that.

@jlebon
Copy link
Member

jlebon commented Feb 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira for syncing to jira kind/enhancement
Projects
None yet
Development

No branches or pull requests

4 participants