-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can consul-replicate be used with Vault? #674
Comments
It's not really a known issue in that this isn't an officially supported setup. My guess is that you're running into the fact that Vault contains a read cache from the physical backend. I would imagine you're seeing this problem mostly for changed data, not new data. In that case, your best current option would be to maintain a local fork that disables the cache. This will considerably impact performance, however. |
That's what I wondered as well. Our use case for vault is simple, it's an encrypted key-value store that's referenced on app startup, and that's just about it. We run consul on the same hosts that we run vault on, so network RTT would be non-existant. I'll see if I can disable the read cache and see what the impact is. I found an issue over in consul-replicate (referenced above) where it looked like someone else was using it to replicate vault data, I've posted there to see if he had more luck than I did. Also, there's another post from the mailing list where it looked like someone else had the same use case, and some other potential issues were brought up: https://groups.google.com/forum/#!msg/vault-tool/cnNqgpZxvh0/naMaRHXZAgAJ To that point, do you think I could achieve what I'm striving for by using a different storage backend? |
I've had success with the replication across sites with consul-replicate and haven't run into a caching issue. However, I use the secondary site in more of a DR scenario so we don't have people pulling from it as often. I can try doing some more testing on my side and see if I get the same scenario. How long does it take for you to get caching issues? You say you saw initial success, is there some time frame in which the remote instance stops picking up the new values? |
@justintime BTW, I marked this as enhancement/thinking earlier because I think it should be pretty easy to make the cache a tunable. It would be quite useful to know if that's actually the source of your issue though. If you like I can help you with a patch to disable it locally for you and you can see if it fixes your problems. |
@roycrowder I think the only reason I saw initial success was because I hadn't started vault in my destination DC yet. I wanted to replicate, and then make sure that the start->unseal->read workflow worked. That did work. I'm not for 100% sure if it was an add or an edit, but basically any changes seemed to not show up until after I restarted and unsealed the vault leader. I'm going to setup some vagrant boxes that I can test this more easily with to make sure. My guess is that if you have vault running in both places, you can add a new k/v and see it replicate. You'll likely be able to read that from the destination vault as well. Changing the value on that same key will likely cause a replication event, but you won't be able to read that new value from the destination vault until you restart. If you can check that and report back, it would help a lot. @jefferai If you have the time to generate a patch, I'll make the time to test it out. |
Hey everyone, Chiming in here from the Consul Replicate (CR) side of things 😄. Putting any Vault-specific issues aside, CR is designed to mirror a particular prefix from one Consul DC to another. If you are using Consul as the backend storage mechanism, CR will definitely replicate the data. There are a few things I would like to mention. CR will only replicate one-way. It can replicate to multiple data-centers, but there has to be one "primary" data center. Consider 3 data centers A, B, and C. B and C both pull replication data from A, so A is the "primary" data center. Any data written to B or C's KV store that falls under the replicated prefix will not be replicated back to A or to it's peer follower. In fact, during the next replication, A will likely overwrite that change in B or C's KV to match A. Assuming Vault's cache is disabled, this means if there's a Vault instance in each DC, any writes to Vault-B and Vault-C would effectively be overwritten on the next change to the KV prefix in DC-A. This behavior might be okay if Vault-B and Vault-C are read-only, but I think even that behavior may be undefined. Similarly, depending on the number of keys watched, replication can occur quite frequently and can take up to a few seconds to complete. This opens up a race condition where Vault-A writes data like a new transit key (Vault-B and Vault-C have an old key), and an application requests a decryption of a secret that was generated via Vault-A's new transit backend key on Vault-B. The decryption would fail because the update Vault data (the new transit key) hadn't yet been populated by CR. |
To echo what @sethvargo just mentioned, I'm happy to work on a patch to test whether disabling the cache helps things out, and I'll try to post one later today, but there are many potential drawbacks to replication and you'll need to be careful. This would be a purely read-only replication and you'd need to limit it to specific prefixes. As I said on the ML thread, you would not, for instance, want to be mirroring your expiration data or you could end up having many revocations attempting to happen for every lease. You'd also probably want to build some robustness into your applications, or think carefully about the configuration knobs set within Vault. For instance, for the transit key example, you'd need to not set your minimum allowed decryption version to the current value, so that if it takes a bit to populate a rotated key applications can still decrypt data -- much less if this is an actual new key with no prior versions. This is definitely a "hope it works but it's definitely not officially supported" scenario :-) |
Yeah, I think everything Seth mentioned still fits my "master with If vault is aware of a "mtime" change on S3 or MySQL and flushes it's read On Wed, Oct 7, 2015 at 10:51 AM, Jeff Mitchell notifications@github.com
|
@justintime The biggest probable issues with this are not really in Consul, per se -- they're with things like propagating leases across to other datacenters. That's a concern regardless of the backend choice. So the main concern here is probably that some backends will make it easier or harder to sync only the proper entries. It turns out that it's actually easy to disable the cache -- it's just not exposed in configuration currently. Open up // Wrap the backend in a cache unless disabled
if !conf.DisableCache {
_, isCache := conf.Physical.(*physical.Cache)
_, isInmem := conf.Physical.(*physical.InmemBackend)
if !isCache && !isInmem {
cache := physical.NewCache(conf.Physical, conf.CacheSize)
conf.Physical = cache
}
} After removing the check for conf.DisableCache and rebuilding, the physical cache will be disabled. I'd be interested in knowing how this works for you. If it solves your problem we could possibly expose this in configuration. |
I'm having troubles getting a build to work. I have 0.2.0 deployed in my source datacenter, but I'm unable to get v0.2.0 to build due to some godep issues? I made the mistake of running 0.3.1 in my destination datacenter, but discovered quickly that version mismatches are not a good thing :) Any tips on getting the 0.2.0 release to build?
|
Make sure you have |
the I'm not so sure that disabling the cache is going to fix my problems anyway. I saw in the code where reads on keys that didn't exist weren't cached. However, adding a new key, replicating it, and attempting to read it on the remote was still failing. That seems to indicate there's other/more things preventing me from replication. I wonder if I've missed something -- @roycrowder I'd love to hear if you found any similar issues when you get the time to check. |
@justintime it might be interesting to attempt a read from two other places when the read is failing on Vault as you're attempting to do it now:
If (1) works but (2) doesn't that will tell us something, because (2) is a straight passthrough to the physical storage (although it does go through the cache). If both (1) and (2) work, try it with a normal read again and see if it works -- maybe by that time something will have finished catching up/propagating. Let me know! |
I was having some weird issues running CR in '-once' mode, so I wiped the remote consul DC store, and started over. Good news, adding a new key to the source almost immediately replicates up, and is readable by the destination vault without a restart. As you guessed, updating an existing entry replicates (at least according to CR verbose logs), but Vault serves up the old data. I'll see if I can catch you on IRC to see about building a modified v0.2.0 to see if disabling the read cache fixes that. |
Disabling the cache in vault did indeed fix the problem. For documentation, I just removed the '!' on https://github.com/hashicorp/vault/blob/master/vault/core.go#L302 and recompiled. At least for my use case, exposing DisableCache in vault.hcl would be awesome. Even better would be an official, supported way to do what I'm doing, but I'll take what I can get :) |
@justintime I believe that should do it -- simply set |
Awesome. I'll build it and test it today. You had mentioned that master was in a good state the other day, I'll hold you to that statement :) |
Just to follow up, this functionality is working as advertised for me. Thanks for the help. |
Great! And I'm looking at #706 separately. |
I think that looks fine. You may not want to replicate |
After cleaning /var/lib/consul/* and starting fresh with the above settings, I'm getting at least getting consistent results. I verified with curl that I'm now getting redirected to the proper leader in each DC rather than the leader of the source DC. After updating the value in the source DC and running my read script, I see all vaults in the source DC have the new value, but all other vaults have the old value. After restarting and unsealing all Vaults, they all agree again on the new value. I'm starting to think we didn't quite get the disable_cache functionality 100% - what's your thoughts? |
Is this using the config value or making that one-line change you made before? |
config value:
|
Any chance you can test with the same code change you tried before? I don't see why the config value wouldn't be populating through, but just in case... |
Progress! I did nothing but remove the ! in the |
Great; I'll dig in there. I did write some (passing) unit tests to ensure that the value was being read from the config files, so somewhere it seems like it's getting lost. |
@justintime Sorry for the delay on this...weekend + daycare woes. I just pushed a change to Let me know! |
Just verified this works as advertised, thanks so much for all of your help! |
@justintime Once you get some experience with this it may be cool for you to put a writeup of your setup and running notes into the mailing list. Other people may want to do something similar. |
Yes, please -- we would love to have a single Vault master in a multi-DC environment. |
@jefferai I'll try and put together a (gasp) new blog post about it. Is it On Fri, Oct 30, 2015 at 10:16 AM, Michael S. Fischer <
|
The next planned release is 0.4, but yes, it will be in there. While I'd love to see this post, I do have a couple of requests, to ensure that people trying to follow in your footsteps don't run into problems:
Thanks! |
It won't be supported in 0.4, or it won't be supported in the current release? |
Clash of terminology. The ability to turn off the cache will be present in 0.4, hence this functionality will be supported. The behavior of using consul-replicate in this way to distribute data to multiple datacenters will not be officially supported. If problems are hit that are fixable (such as adding the flag to disable the read cache), I'm happy to work to fix them, but the overall solution is at your own risk, at least for now. |
OK, will wait until it's officially supported. |
There are currently no plans on the roadmap to officially support this kind of replication, so it might be a long wait. However, a better way to look at it might be that there is a not-officially-written-down-but-listed-above-in-the-comments set of prefixes and keys that should not be replicated to any read-only set of Vault servers. Maybe at some point that list can become more official, but then it lets users manage the replication however they wish, as appropriate for their chosen storage backend. |
While I appreciate @justintime 's research in this matter, I'm reticent to implement a solution that may not work with a future Vault release. Can we get this on the roadmap, please? |
@mfischer-zd The set of paths one would want to replicate will vary from one installation to the next. The only thing that is fairly constant are the things you don't want to replicate. That could be codified into a set of best practices, but any such best practices document would continue to inform the user that messing about with the storage layer of Vault will never be supported. |
It's fine, in my view, to declare that real-time replication is supported only when Consul is the storage backend and
This is what worries me. You're not willing to make any stability guarantees here? |
I cannot promise that Vault's internal data store will never change. This means that, if relying upon some specifics of the layout of the internal data store, you'd want to perform due diligence during an upgrade cycle -- the same as any other aspect of Vault that undergoes changes from one version to the next. |
How about notices to replication users in release notes, or bumping the major version per semver when the storage layout changes? |
Vault is still very young and nowhere near 1.0, much less a major version bump, regardless of what changes. Storage level changes would never happen in anything less than a minor version bump, currently. |
When I run something with the latest version of 0.3.1 in production, I fully expect to have to put a lot of extra rigor into even minor versions upgrades. Comes with the territory, IMO. I'll make 0.3.1 work now, and when I upgrade, it will have weeks of testing in dev and staging before it goes to production. |
@justintime are you able to replicate secret engines from source to destination.? |
Our use case dictates that we need to have Vault and it's secrets available in multiple physical datacenters. However, writes to Vault only need to happen at one location. This makes for a very typical "one master with many slaves" arrangement.
Since we're using consul for the backend, I figured consul-replicate would be the perfect tool for the job. After setting it all up and running consul-replicate, I was happy to have initial success. However, after using it a bit more, I think I've found a showstopper for this setup.
It seems that each time consul-replicate copies a key/value from one consul dc to another, I have to restart the vault servers in the second datacenter before they will "see" that data.
Is this a known issue? Is there any workarounds, or another way to accomplish this kind of setup?
The text was updated successfully, but these errors were encountered: