Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transient storage mode #1424

Merged
merged 6 commits into from
Nov 21, 2022
Merged

Conversation

alexlarsson
Copy link
Contributor

This is the containers/storage part of containers/podman#16371

It adds a new global storage mode option called 'transient_store'. If this is true then the primary databases for container-specific data (i.e. not image data) is made transient, by placing the storage in runroot (i.e. on tmpfs).

In particular, for c/s this implies:

  • .../storage/overlay-containers is renamed .../storage/overlay-transientcontainers, and its layers.json is stored in the runroot.
  • There is a new location ../storage/overlay-transientlayers/ that is similar to ../storage/overlay-layers/, but is only used for the container layers (not images), and that has the layers.json file in runroot.

This completely erases the time spent fdatasync() calls for the json files written during podman run.

Note that this only moves the "database files" to runroot. The actual storage etc is still on persistent storage (to avoid filling memory). On a fresh boot we will always get empty databases so the old files will not be referenced. This means no old state will get in the way and mess things up in case of an unclean shutdown. However, the space is not reclaimed, so this should ideally be paired with code in the podman packaging to clear these directories on boot. For this reason the transient directories have different names so they are easy to clean up,

@rhatdan
Copy link
Member

rhatdan commented Nov 7, 2022

@giuseppe @mtrmac @nalind @flouthoc PTAL

@rhatdan
Copy link
Member

rhatdan commented Nov 7, 2022

@saschagrunert PTAL

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level:

Overall this feels like quite a lot of complexity, and it seems fairly likely to be a maintenance headache:

  • Users must manually clean up half of the storage on reboot
  • Where is the the lock file stored, in the persistent or non-persistent data?
  • IIRC in general, creating cross-store layers is unexpected; AFAIK that only works for overlay.
  • How are we going to keep this working? Do we need a whole new third mode for running all tests?

As a thought experiment, what would happen if

  • there were no separate transient mode
  • the “fast podman run” case was instead using a fresh graphroot in a temporary location, not even necessarily tmpfs, pointing at the long-term layer+image store using the existing “additional stores” mechanism?
  • there were some option to just turn off the fdatasync, so that all the metadata I/O has negligible latency

? I think the major difference would be podman run triggering an image pull, because the image would be pulled into the temporary layer store, not the primary one. Is that a problem?


Individual review comments are under the assumption that the high-level concerns are overcome or irrelevant — and probably not exhaustive.

containers.go Outdated
Comment on lines 264 to 307
if r.transient {
return filepath.Join(r.rundir, "containers.json")
}
return filepath.Join(r.dir, "containers.json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally, it would be simpler and more efficient to replace this method with a containersJSONPath field set in the constructor instead of recomputing all the time.

Then we wouldn’t even need the rundir and transient fields.


Applies similarly to the layer store.

containers.go Show resolved Hide resolved
containers.go Outdated
@@ -120,6 +120,8 @@ type rwContainerStore interface {
type containerStore struct {
lockfile Locker
dir string
rundir string
transient bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely non-blocking: runDir is more idiomatic Go. The existing fields don’t conform, maybe we can just stop following that precedent.

containers.go Outdated Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
store.go Outdated
return ErrNotALayer
}

if l, err := lstore.Get(id); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An unrelated six-year-old bug: This must have been intended to be err == nil

store.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
@alexlarsson
Copy link
Contributor Author

I see my comment went to the wrong place, pasting here too:

High-level:

Overall this feels like quite a lot of complexity, and it seems fairly likely to be a maintenance headache:

* Users must manually clean up half of the storage on reboot

I don't think this is actually a real problem. The intent is to ship podman with some systemd unit file that just does this for you. Then all you need to worry about is ordering that before starting any systemd services using podman.

* Where is the the lock file stored, in the persistent or non-persistent data?

It is important to get this right (i.e use the same in all places), and I think it makes sense to have it on the transient location as persistent locks have no meaning. However, I don't see any higher-level issue with this?

* IIRC _in general_, creating cross-store layers is unexpected; AFAIK that only works for overlay.

This however seems like a real issue. I'm pretty sure vfs is easy to make work, but for devmapper and btrfs this is not really possible. We could make it possible for the driver to expose a flag for supporting this or not, and then we could do the splitting of the layers store optionally.

* How are we going to keep this working? Do we need a whole new third mode for running all tests?

That seems like the "easiest" way, but I haven't yet looked at the storage testsuite. Is long test runtimes a problem for c/s currently?

As a thought experiment, what would happen if

* there were no separate transient mode

* the “fast `podman run`” case was instead using a fresh graphroot in a temporary location, not even necessarily `tmpfs`, pointing at the long-term layer+image store using the existing “additional stores” mechanism?

* there were some option to just turn off the `fdatasync`, so that all the metadata I/O has negligible latency

What would generally happen is that you disconnect the new container from all the global podman state. Like, unless you jump through hoops you can't: get logs for it, see it in podman ps, use podman health-checks, use named volumes, share namespaces with other containers, etc, etc. I'm also unsure how well podman handles the "read-only additional store" changing over time, although I see read locks for it, so that may be ok. (I guess some of these issues are due to the related boltdb transient store changes and not necessarily the storage changes, but we want a complete solution so they need to be considered together.)

The initial idea for transient mode was actually similar to this, i.e. just point rootdir and runroot somewhere on tmpfs, or mount a tmpfs on rootdir. Having a "no fsync" mode for storage is quite similar to this.

The core issue here is that there is a mix of things in storage, some that we want to persist (backing store for image layers, image db, backing store for volumes, etc), and some we don't. The databases mix information about these, so any update has to be fully robust, even when just changing information about transient stuff. Splitting this up is really where the complexity of this PR comes from.

I don't really see a way to avoid the splitting though, because fundamentally we do want to have some persisted state, and we do want to use that global state with the transient state. However, maybe we can split it in a different way. For example, we could have a shared /var/lib/containers/storage/overlay-layers/ directory, but we can have two json files in it, one for transient layers and one for permanent. That may make the changes less complex. Of course this also makes it harder to clean up to a "fresh" state in case of an unclean shutdown or other issue.

I'll have a look at the two-json-files approach and see how it would look.

? I think the major difference would be podman run triggering an image pull, because the image would be pulled into the temporary layer store, not the primary one. Is that a problem?

For our particular usecase, I don't think this is a problem, because we expect the install of an image to be separate from the running of it, but that depends on the user though.

@mtrmac
Copy link
Collaborator

mtrmac commented Nov 8, 2022

I guess a disclaimer: I’ve been digging into the code recently, but I don’t really have a long-term understanding of the concerns and trade-offs.

  • Users must manually clean up half of the storage on reboot

I don't think this is actually a real problem. The intent is to ship podman with some systemd unit file that just does this for you. Then all you need to worry about is ordering that before starting any systemd services using podman.

(That wouldn’t work well for unprivileged users in “rootless” mode, OTOH they probably wouldn’t enable the transient mode.)

  • IIRC in general, creating cross-store layers is unexpected; AFAIK that only works for overlay.

This however seems like a real issue. I'm pretty sure vfs is easy to make work, but for devmapper and btrfs this is not really possible. We could make it possible for the driver to expose a flag for supporting this or not, and then we could do the splitting of the layers store optionally.

I don’t have a good understanding how this works, but IIRC the current cross-store support for additional stores is quite ad-hoc: some parts are implemented at the high level of store.go, and some are only inside the overlay driver.

Now that could be an argument for cleaning up and generalizing, as well as an argument for trying to minimize use of complexity like this…

  • How are we going to keep this working? Do we need a whole new third mode for running all tests?

That seems like the "easiest" way, but I haven't yet looked at the storage testsuite. Is long test runtimes a problem for c/s currently?

My first guess is that the tests could be run in parallel, at an extra monetary cost.


As a thought experiment, what would happen if

* there were no separate transient mode

* the “fast `podman run`” case was instead using a fresh graphroot in a temporary location, not even necessarily `tmpfs`, pointing at the long-term layer+image store using the existing “additional stores” mechanism?

* there were some option to just turn off the `fdatasync`, so that all the metadata I/O has negligible latency

What would generally happen is that you disconnect the new container from all the global podman state. Like, unless you jump through hoops you can't: get logs for it, see it in podman ps, use podman health-checks, use named volumes, share namespaces with other containers, etc, etc.

Right, the BoltDB database (or just some parts of it?) would also need to be moved to tmpfs...


I'm also unsure how well podman handles the "read-only additional store" changing over time, although I see read locks for it, so that may be ok.

IIRC it’s expected to work, at least we suggest usage where the additional store is centrally-managed, which basically requires it to be robust against changes. (Removing a layer from the additional store can always break the primary one, until an image is (deleted and?) re-pulled.)


The core issue here is that there is a mix of things in storage, some that we want to persist (backing store for image layers, image db, backing store for volumes, etc), and some we don't. The databases mix information about these, so any update has to be fully robust, even when just changing information about transient stuff. Splitting this up is really where the complexity of this PR comes from.

That’s true. I guess where I struggle is whether this absolutely needs to be an additional mode. Maybe it does, because we need some updates to the usually-persistent state anyway, and that’s where the actual cost is?


Or, uh, should Podman actually be much smarter, and automatically use transient stores, with non-durable writes, for podman run --rm (but record things to persistent storage for non---rm runs)?

That idea looks too tempting and almost “cute”, that I’m almost sure there’s a good reason that couldn’t work… and of course it would mean extending the current two-layer-store approach to the container store (and the BoltDB data)?


I don't really see a way to avoid the splitting though, because fundamentally we do want to have some persisted state, and we do want to use that global state with the transient state. However, maybe we can split it in a different way. For example, we could have a shared /var/lib/containers/storage/overlay-layers/ directory, but we can have two json files in it, one for transient layers and one for permanent. That may make the changes less complex. Of course this also makes it harder to clean up to a "fresh" state in case of an unclean shutdown or other issue.

I'll have a look at the two-json-files approach and see how it would look.

I really don’t know enough to predict how this would end up. It might be much simpler if layerStore.lookup just searched in two locations, OTOH the write paths would all need updating.

(BTW I don’t think having two JSON files inherently requires them to be recorded in the same directory; they could use the proposed two-directory layout probably just as easily.)


What’s the core reason of the performance difference? Is almost all of it the fdatasync cost? Is it a lot of small things? Avoiding the complexity of the filesystem / journaling code, even if all in RAM? Something else entirely?

@alexlarsson
Copy link
Contributor Author

I guess a disclaimer: I’ve been digging into the code recently, but I don’t really have a long-term understanding of the concerns and trade-offs.

  • Users must manually clean up half of the storage on reboot

I don't think this is actually a real problem. The intent is to ship podman with some systemd unit file that just does this for you. Then all you need to worry about is ordering that before starting any systemd services using podman.

(That wouldn’t work well for unprivileged users in “rootless” mode, OTOH they probably wouldn’t enable the transient mode.)

Actually, the cleanup issue needs a bit more thinking in general. It turns out that the overlay-layers directory doesn't really have much in it other than the layers.json file, at least for container layers (it does also have tar-splits for pulled image layers). The "overlay-containers" directory does have a bunch of config.json files, which would make sense to clean up on boot. On the other hand, these are small and we should maybe just put them on the rundir in the transient store mode anyway.

However, the layers.json file, as well as the config.json file does then reference the actual backing store for the driver, which in the overlay backend is the overlay directory. This will have a mix of image layers and container layers, and this is actually the more important directory to clean up stuff from, as it has the container filesystem on it. Its harder to clean this up as it is mixed though. Maybe we should pass down the "is container layer" to the driver so it can mark the directory for cleanup on boot somehow.

What would generally happen is that you disconnect the new container from all the global podman state. Like, unless you jump through hoops you can't: get logs for it, see it in podman ps, use podman health-checks, use named volumes, share namespaces with other containers, etc, etc.

Right, the BoltDB database (or just some parts of it?) would also need to be moved to tmpfs...

Yes, this is important, and already done in the podman pr (containers/podman#16371)

The core issue here is that there is a mix of things in storage, some that we want to persist (backing store for image layers, image db, backing store for volumes, etc), and some we don't. The databases mix information about these, so any update has to be fully robust, even when just changing information about transient stuff. Splitting this up is really where the complexity of this PR comes from.

That’s true. I guess where I struggle is whether this absolutely needs to be an additional mode. Maybe it does, because we need some updates to the usually-persistent state anyway, and that’s where the actual cost is?

Or, uh, should Podman actually be much smarter, and automatically use transient stores, with non-durable writes, for podman run --rm (but record things to persistent storage for non---rm runs)?

This is problematic at least for the bold db. We rely on the transactions it uses for atomicity and verification of cross-table links (like which containers use which volume, etc). We can't split the db file in two files and still get ACID behaviour. This is why this was made a global mode, because I agree with you that it would be cool if --rm containers were always faster.

That idea looks too tempting and almost “cute”, that I’m almost sure there’s a good reason that couldn’t work… and of course it would mean extending the current two-layer-store approach to the container store (and the BoltDB data)?

I don't really see a way to avoid the splitting though, because fundamentally we do want to have some persisted state, and we do want to use that global state with the transient state. However, maybe we can split it in a different way. For example, we could have a shared /var/lib/containers/storage/overlay-layers/ directory, but we can have two json files in it, one for transient layers and one for permanent. That may make the changes less complex. Of course this also makes it harder to clean up to a "fresh" state in case of an unclean shutdown or other issue.
I'll have a look at the two-json-files approach and see how it would look.

I really don’t know enough to predict how this would end up. It might be much simpler if layerStore.lookup just searched in two locations, OTOH the write paths would all need updating.

(BTW I don’t think having two JSON files inherently requires them to be recorded in the same directory; they could use the proposed two-directory layout probably just as easily.)

I'm currently looking at this, and it seems to end up as a much smaller change, so it seems a better approach. Will push new version later today.

What’s the core reason of the performance difference? Is almost all of it the fdatasync cost? Is it a lot of small things? Avoiding the complexity of the filesystem / journaling code, even if all in RAM? Something else entirely?

The performance is basically all in the fdatasync(). Although in the automotive usecase we are also quite interested in the more robust behaviour that comes with starting from a fresh set of databases on boot too, because that limits the risk of weird shutdown issues making the car not start up when booted again.

@mtrmac
Copy link
Collaborator

mtrmac commented Nov 8, 2022

This is problematic at least for the bold db. We rely on the transactions it uses for atomicity and verification of cross-table links (like which containers use which volume, etc). We can't split the db file in two files and still get ACID behaviour.

Thanks, that’s interesting to know.

@rhatdan
Copy link
Member

rhatdan commented Nov 8, 2022

Could you just use $RUNDIR/overlay-containers for transient containers by default? Since only small stuff is stored there.

@alexlarsson
Copy link
Contributor Author

Ok, new version. Rather than having two stores now we just have one container store and one (rw) layer store. However, both stores save two json files, the regular one and a "volatile" one. The volatile one is saved using NoSync, meaninging it is faster two write, but any write to it risks losing all the data in it (not just the modification written), so we don't want to store any persistent info in it. In addition, when in transient mode, the volatile files are stored in the rundir.

For normal mode, volatile containers are those with --rm, and volatile layers are container layers of such containers.

In transient mode, all container and all layers for containers are volatile.

This brings us a performance improvement for regular podman run --rm containers:

orig:

Benchmark 1: bin/podman.orig run --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true
Time (mean ± σ): 174.9 ms ± 6.4 ms [User: 45.7 ms, System: 27.0 ms]
Range (min … max): 164.4 ms … 185.8 ms 16 runs

new:

Benchmark 1: bin/podman run --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true
Time (mean ± σ): 137.8 ms ± 6.9 ms [User: 45.6 ms, System: 28.0 ms]
Range (min … max): 128.2 ms … 152.0 ms 19 runs

And an even better improvement (includes e.g. transient bolt db) transient mode:

Benchmark 1: bin/podman run --transient-store=true --rm --pull=never --network=host --security-opt seccomp=unconfined fedora true
Time (mean ± σ): 101.8 ms ± 5.4 ms [User: 40.9 ms, System: 24.3 ms]
Range (min … max): 93.2 ms … 114.0 ms 29 runs

@alexlarsson
Copy link
Contributor Author

I resolved a bunch of outdated comments.

@giuseppe
Copy link
Member

giuseppe commented Nov 9, 2022

the last version is much nicer, LGTM

I've some comments with the podman integration though. I think the setting in storage must be always honored if set (unless it is overridden to false)

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a fairly quick skim for now.

This looks very nice!


The containers.go review comments mostly apply to layers.go equally.

containers.go Outdated Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
return err
for _, location := range containerLocations {
rpath := r.containerspath(location)
if err := os.MkdirAll(filepath.Dir(rpath), 0700); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually done by newContainerStore already, so maybe we can avoid that syscall or two on every write.

Alternatively, at least move it below the locationModified check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexlarsson WDYT ^^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno. All the various stores seem to do this, there might be some weird reason for it. If we do this, that should be in its own PR imho.

containers.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
store.go Show resolved Hide resolved
containers.go Outdated Show resolved Hide resolved
@alexlarsson
Copy link
Contributor Author

New version again, this time the location are treated as flags instead of an enum, so we can just pass the set of all locations to write to save(), and we can drop the whole "mark as dirty" thing. Its not quite what @mtrmac said, but I think this is clean while at the same time keeping the old public Save() method around with unchanged behaviour for compat.

@alexlarsson
Copy link
Contributor Author

Now dropped the Save() calls

@alexlarsson
Copy link
Contributor Author

Hmmm, it feels wrong to have layers.lock in rundir, while the non-transient layers.json is not in rundir. What if something uses this directory as a additionaldirectory. Seems safer to have the lock next to the layers.json. For the containers case it should be fine though.

@mtrmac
Copy link
Collaborator

mtrmac commented Nov 10, 2022

Hmmm, it feels wrong to have layers.lock in rundir, while the non-transient layers.json is not in rundir. What if something uses this directory as a additionaldirectory.

Great point, yes.

(And if this intentionally differs between containers and layers, please add a comment documenting this rationale to emphasize the two should stay different.)

@alexlarsson
Copy link
Contributor Author

So. I'm pretty happy with how this is now. The only thing remaining imho is to add some kind of operation that cleans up the store for leftovers from transient containers.

This shouldn't really be that hard. For the overlay-containers directory

For the overlay-containers directory I don't think we have to do anything, all we need to do is remove all the subdirectories that are not listed in the current containers.json and transient-containers.json. These should all be essentially empty anyway (except some minor directory structure), so this will be fast too.

For overlay-layers it is a bit trickier. There is nothing actually in that directory that persists, but there is the storage side data. For example, for the overlay driver we need to delete all the subdirectories in the overlay subdirectory that are not listed in the layers.json file. This is going to need some (optional) extensions to the driver interface I think.

To avoid races i think you need to (in this order):

  1. list all existing layer ids in driver
  2. lock layer store and list all all ids that should be there
  3. driver.Remove(id) for all existing ids that should not be there, ignoring enoexist errors if there were some race between 1 and 2

We can make this optional and just implement if for overlay and vfs initially.

@rhatdan
Copy link
Member

rhatdan commented Nov 10, 2022

needs a rebase

@alexlarsson alexlarsson force-pushed the transient-store branch 3 times, most recently from bd8effb to f16e34d Compare November 11, 2022 14:39
@rhatdan
Copy link
Member

rhatdan commented Nov 14, 2022

LGTM, but there are a couple of @mtrmac questions remaining.

@nalind @mtrmac @vrothberg @saschagrunert @flouthoc @umohnani8 PTAL

@alexlarsson
Copy link
Contributor Author

@rhatdan most of the questions were stale, i cleared those. The remaining i think needs to be an separate pr.

@alexlarsson
Copy link
Contributor Author

Lemme rebase also.

This is to allow easy use of the NoSync option.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
This just adds the support for setting and handling this key, it
doesn't really do anything yet.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
This splits up the containers.json file in the containers store into
two, adding the new file `volatile-containers.json`. This new file is
saved using the NoSync options, which is faster but isn't robust in
the case of an unclean shutdown.

In the standard case, only containers marked as "volatile" (i.e. those
started with `--rm`) are stored in the volatile json file. This means
such containers are faster, but may get lost in case of an unclean
shutdown. This is fine for these containers though, as they are not
meant to persist.

In the transient store case, all containers are stored in the volatile
json file, and it (plus the matching lock file) is stored on runroot
(i.e. tmpfs) instead of the regular directory. This mean all
containers are fast to write, and none are persisted across boot.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
We store information about the layers in two files, layers.json and
volatile-layers.json. This allows us to treat the two stores
differently, for example saving the volatile json with the NoSync
option, which is faster (but less robust).

In normal mode we store only layers for the containers that are marked
volatile (`--rm`), as these are not expected to persist anyway. This way
informantion about such containers are faster to save, and in case
of an unclean shutdown we only risk loss of information about other
such volatile containers.

In transient mode all container layers (but not image layers) are
stored in the volatile json, and additionally it is stored in tmpfs.
This improving performance as well as automatically making sure no
container layers are persisted across boots.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
This looks in the container store for existing data dirs with ids not in
the container files and removes them. It also adds an (optional) driver
method to list available layers, then uses this and compares it to the
layers json file and removes layers that are not references.

Losing track of containers and layers can potentially happen in the
case of some kind of unclean shutdown, but mainly it happens at reboot
when using transient storage mode. Such users are recommended to run
a garbage collect at boot.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
If STORAGE_TRANSIENT is set to 1 then transient mode will be enabled
in the tests.

Also adds overlay-transient to the cirrus CI.

Signed-off-by: Alexander Larsson <alexl@redhat.com>
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan rhatdan merged commit 71fd3e8 into containers:main Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants