Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix eclipsing vms in ctor #2101

Merged
merged 10 commits into from
May 18, 2021
Merged

Fix eclipsing vms in ctor #2101

merged 10 commits into from
May 18, 2021

Conversation

ricab
Copy link
Collaborator

@ricab ricab commented May 10, 2021

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

@ricab ricab force-pushed the fix-eclipsing-vms-in-ctor branch from c70ebd7 to dddae4d Compare May 11, 2021 09:46
ricab added 3 commits May 11, 2021 11:48
Let exceptions that arise when creating a VM through. This avoids
erasing instance data that could be recoverable and important for the
user, but aborts the daemon since it can't honor its database, which
would otherwise be out of sync. More robust approaches for particular
causes can always added. Fixes #1658.
Log and remove, from the db, instances whose images are missing when the
daemon comes back up. This covers cases where the user deletes instances
directly from the backend, or otherwise removes them from disk.
@ricab ricab force-pushed the fix-eclipsing-vms-in-ctor branch from dddae4d to 900fe31 Compare May 11, 2021 10:49
ricab added 2 commits May 11, 2021 18:33
Drop prepare call from the mock vault's default action for fetch_image,
which when called through `fetch_images_for` in daemon.cpp, returned an
empty image. Fixes a few test failures now that the daemon checked that
instance images exist in the ctor.
Replace instance creation throw in ctor with image verification failure.
@codecov
Copy link

codecov bot commented May 11, 2021

Codecov Report

Merging #2101 (178cd47) into main (981cf15) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2101      +/-   ##
==========================================
- Coverage   81.48%   81.47%   -0.01%     
==========================================
  Files         184      184              
  Lines        9456     9457       +1     
==========================================
  Hits         7705     7705              
- Misses       1751     1752       +1     
Impacted Files Coverage Δ
src/daemon/daemon.cpp 54.36% <100.00%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 981cf15...178cd47. Read the comment docs.

@ricab ricab marked this pull request as ready for review May 13, 2021 10:03
@townsend2010 townsend2010 self-assigned this May 17, 2021
Copy link
Contributor

@townsend2010 townsend2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I think failing the daemon is better than deleting someone's instance wholesale, I'm still concerned that if the daemon can't start, then how does a user recover from this?

In the end, I'll approve this, but I'd like some feedback from the team here on how we plan to address a situation where the daemon just won't start due to hitting this.

@Saviq
Copy link
Collaborator

Saviq commented May 17, 2021

In the end, I'll approve this, but I'd like some feedback from the team here on how we plan to address a situation where the daemon just won't start due to hitting this.

Oh sure, this is a stop-gap. Long-term we should mark the instance "Broken" and skip over it. Another multipass start on it could recover it - if not, back to "Broken". Details TBD.

@townsend2010
Copy link
Contributor

Oh sure, this is a stop-gap. Long-term we should mark the instance "Broken" and skip over it. Another multipass start on it could recover it - if not, back to "Broken". Details TBD.

Sure, I definitely understand this is a stop-gap and is better than just deleting the instance. But we are trading deleting an instance wholesale with the potential of the daemon not starting at all (which is a better failure). My question is, if/when users complain that Multipass is not starting at all, how will we deal with that? Have users modify the instance DB to remove the problematic instance without deleting it in order to get the daemon to start? Wing it and just cross that bridge when we get to it?

Again, I'm not saying this shouldn't go in, I just want us all to be on the same page about how we'll deal with the daemon not starting. I guess where I'm going with this is we really need a full solution for this ASAP.

@ricab
Copy link
Collaborator Author

ricab commented May 17, 2021

There are two sorts of situations that this PR covers by aborting the daemon:

  1. general problem preventing any instance from being launched (e.g. backend doesn't work)
  2. a corrupt/buggy instance.

If I am not mistaken, most reports have been of the first kind. In such cases, marking all instances as BROKEN is not much better then quitting: if instances can't be launched, there isn't much that daemon could do anyway. That would be useful mainly for situation 2. But, depending on how frequent we find case 2 to be, maybe guiding users to remove/recover instances manually won't be that big a deal and that "full solution" may not be all that urgent. I do think it's less urgent than fixing permanently erased instances.

Of course, if we find specific measures to help on a case-by-case basis (e.g. a bigger timeout somewhere), we can implement them.

Copy link
Contributor

@townsend2010 townsend2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is better than deleting an instance.

bors merge

bors bot added a commit that referenced this pull request May 17, 2021
2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 17, 2021

Build failed:

@Saviq
Copy link
Collaborator

Saviq commented May 17, 2021

bors retry

bors bot added a commit that referenced this pull request May 17, 2021
2099: [qemu] ensure `dnsmasq.hosts` exists r=townsend2010 a=surahman

**_With respect to issue #1118:_**
Added a check on `DNSMasq` bootstrapping for `dnsmasq.hosts` file in the constructor. If absent, an empty file will be created in the `data_dir`.

2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Saad Ur Rahman <saad.ur.rahman@gmail.com>
Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 17, 2021

Build failed (retrying...):

bors bot added a commit that referenced this pull request May 17, 2021
2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 17, 2021

Build failed:

  • Windows

@Saviq
Copy link
Collaborator

Saviq commented May 17, 2021

bors retry

bors bot added a commit that referenced this pull request May 17, 2021
2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 17, 2021

Build failed:

@Saviq
Copy link
Collaborator

Saviq commented May 18, 2021

bors retry

bors bot added a commit that referenced this pull request May 18, 2021
2099: [qemu] ensure `dnsmasq.hosts` exists r=townsend2010 a=surahman

**_With respect to issue #1118:_**
Added a check on `DNSMasq` bootstrapping for `dnsmasq.hosts` file in the constructor. If absent, an empty file will be created in the `data_dir`.

2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Saad Ur Rahman <saad.ur.rahman@gmail.com>
Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 18, 2021

Build failed (retrying...):

bors bot added a commit that referenced this pull request May 18, 2021
2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@bors
Copy link
Contributor

bors bot commented May 18, 2021

Build failed:

@Saviq
Copy link
Collaborator

Saviq commented May 18, 2021

bors retry

bors bot added a commit that referenced this pull request May 18, 2021
2101: Fix eclipsing vms in ctor r=townsend2010 a=ricab

Let exceptions that arise when creating a VM in the daemon constructor through. This avoids erasing instance data that could be recoverable and important for the user, but aborts the daemon since it can't honor its database, which would otherwise be out of sync. More robust approaches for particular causes can always added. Fixes #1658.

Co-authored-by: Ricardo Abreu <ricardo.abreu@canonical.com>
@ricab
Copy link
Collaborator Author

ricab commented May 18, 2021

This is getting funny 😆

@townsend2010
Copy link
Contributor

🤦

@bors
Copy link
Contributor

bors bot commented May 18, 2021

Build failed:

@Saviq
Copy link
Collaborator

Saviq commented May 18, 2021

Gah.

bors merge

@bors
Copy link
Contributor

bors bot commented May 18, 2021

@bors bors bot merged commit 2facb04 into main May 18, 2021
@bors bors bot deleted the fix-eclipsing-vms-in-ctor branch May 18, 2021 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lost multipass instances on macOS
3 participants