DataProtection: IDataProtector.UnProtect can force refresh the key ring if it fails to unprotect #3975

kevinlo · 2018-11-08T01:12:16Z

I have an App1 that will create the cookie and have 20 instances of App2 and 20 instances of App3 that will read the cookie. The data protection key is stored in Couchbase.

When these applications are first deployed without the key in the Couchbase, all 20 instances of App2 and 20 instances of App3 are started. They all CreateNewKey. Because the Couchbase indexes are updated cyclicly, every 20ms, the GetAllKeys call may not see the key added by the CreateNewKey immediately. I finally see 62 keys created.

I think because of that 20ms time lag, the GetAllKeys call in some instances of App2 and/or App3 get more keys than others. The App1 is started last and it will get all the keys and so its default key is newer than some instances of App2 and App3. Therefore, the cookie protected by the newest default key in the App1 cannot be unprotected by some instances of App2 and App3.

Currently, KeyRingBasedDataProtector.UnprotectCore calls the _keyRingProvider.GetCurrentKeyRing to get the CacheableKeyRing. The CacheableKeyRing.IsValid just checks if the keys in the key ring is not revoked and expired. In the above case, the keys are all valid and the CacheableKeyRing will be used, and it does not have the latest key from the DB.

Should the IKeyRingProvider.GetCurrentKeyRing have a forceRefresh parameter default to false so that if the KeyRingBasedDataProtector.UnprotectCore fails, it will call GetCurrentKeyRing(true) to force it to refresh the key rings and try again OR provide another way to tell the IKeyRingProvider to discard the cache? KeyRingBasedDataProtector.UnprotectCore can try to unprotect again with the refreshed key ring. At least it can force it to discard the cache and get the refresh one so next unprotect call would be good.

The current workaround I think of is making the App1 startup first so it will generated the key. When the App2 and App3 are started, they can read that key and App2 and App3 always have the same or newer key as App1.

I think that application dependency is a possible solution but it would be nice if the ASP.NET Core Data Protection can handle this situation magically so I don't need to care about the application dependency.

blowdart · 2018-11-08T18:40:08Z

We don't refresh like this because it's easy to send a fake payload with a new key ID in every request, which would then block all incoming requests until the refresh finishes. It's too easy a DoS vector.

Your workaround really is the only solution. You could, in startup, just manually spin up a protector and call protect so you're not waiting for the first inbound call to start creating a keyring.

kevinlo · 2018-11-08T22:42:01Z

The DataProtectionStartupFilter.Config function will call the IKeyRingProvider.GetCurrentKeyRing to preload the ring, I don't need to put anything extra in the Startup. The problem is IIS does not start the process until it has the request. IIS8 or later has the preloadEnabled attribute that I can try.

My current workaround works only because the protect\unprotect is one way - always have the App1 protect the cookie and App2 and App3 to unprotect it. If the data protection for the shared apps is in any direction, it will still have the problem. Is there a better solution to make all shared apps having the same keyring?

natemcmaster · 2019-04-15T16:52:38Z

We're seeing issues related to this pop up and are going to investigate possible solutions. I've assigned myself and put it in a 3.0 milestone so this is tracked and reviewed in the coming weeks.

kevinlo · 2019-04-29T18:18:17Z

I just come across another problem during a few days of performance testing. As said in the above thread, we have App1 to create a cookie and App2 and App3 will use the same data protection key to read the cookie. App2 and App3 are started and ended per user session, During the few days of performance, each day we have a few hours that can see the performance hit. I find out that, when some of the App2 or App3 started, they had transient problem in getting the data protection keys from the Couchbase and the IXmlRepository.GetAllElements fails. Therefore, a new key is created and the StoreElement is called and the new key is saved to the Couchbase. We have 3 nodes before the load balancer. After around 18-24 hrs (KeyRingRefreshPeriod), the App1 in one of the node has the GetAllElements call, it will now have that new key set as Default key. The cookie created by the App1 in that node cannot be decrypted by any App2 and App3 while those App1 in the other two nodes are still ok. However, the App1 in the other two nodes will have the GetAllElements call one by one and they are used the new key. Then, App2 and App3 will have the GetAllElements call and then they can use the new key to decrypt. Therefore, from the first App1 in a node getting the new key till all App2 and App3 in all nodes getting that new key, we have seen performance degradation.

One workaround now is changing App2 and App3 to DisableAutomaticKeyGeneration so only App1 can generate new key. But then after the 90 days key expiration, App1 in one node will still generate a new key and we still need to make sure App1 in other nodes and App2 and App3 get that new key.

What is the best practice to achieve this key synchronization among different apps and in different nodes?

natemcmaster · 2019-06-12T04:20:20Z

I haven't had time to make progress on this one. @blowdart @Pilchie who can we hand this off to while I'm on leave? This one needs some design and further investigation.

Pilchie · 2019-06-19T17:13:55Z

@HaoK - is this something you can take a look at while @natemcmaster is on leave?

HaoK · 2019-06-19T17:15:48Z

Sure I will take a look

HaoK · 2019-06-19T20:31:01Z

Discussed with @blowdart initial plan for the fix will be the following:

When an unknown key id is encountered during a short(configurable?) window of time after dataprotection first starts up, we will automatically refresh the key ring (potentially multiple times if multiple unknown keys are encountered during this window) to mitigate the start up race condition. If the refresh doesn't discover the key, we will fail the request. This will be done in a blocking way, but since its only during a window around the apps starting up this seems like an acceptable tradeoff

HaoK · 2019-06-19T21:26:07Z

Moving to preview 8 since I'm not familiar enough with this code to get this in immediately which would be what it takes to for preview 7

GrabYourPitchforks · 2019-06-19T22:52:13Z

The tradeoff seems appropriate as a short-term solution.

From an internal email thread: The best fix would probably be a fix at the storage layer itself, since the storage layer is intended to be quasi-transactional with respect to updates. If this issue manifests itself primarily with the file system storage provider, one potential solution would be to use a monotonically incrementing counter instead of a random GUID for the filename. That is, if the file "key-00000000.xml" doesn't exist then the application would attempt to write that file, if it does then they'd try "key-00000001.xml", and so on. This means that if two applications try to write "key-xxxxxxxx.xml" at the same time, one of them will succeed and one of them will see the conflict and can perform a refresh. This restores the quasi-transactional properties of the file system store.

Note that there's no relationship between the filename and the contents of the XML file. The name of the file can be completely divorced from the key id contained within. Updating the file system store to use a predictable filename as in the proposal here can be done without introducing a breaking change to the DataProtection layer, as all .xml files are read by default anyway, so all existing keys will still be honored regardless of filename.

kevinlo · 2019-06-19T23:49:21Z

@HaoK Your suggestion handles the Startup case. How's about the key renew case? The key will be renewed after NewKeyLifetime (90 days by default) and the KeyRingRefreshPeriod is hard coded to 24 hrs and the GetRefreshPeriodWithJitter will fudge the refresh period up to -20%. If App1 renew first, there is that 4-5 hrs period where App2 and App3 are using old keys.

@GrabYourPitchforks You talk about the file system only. How's about those that implement the IXmlRepository and save the key in DB?

GrabYourPitchforks · 2019-06-19T23:56:04Z

@kevinlo The IXmlRepository is supposed to be quasi-transactional. IMO the bug is in the file system repository, hence that's where I suggested making the change. If a custom-written repository has a similar bug, then a similar change can be made.

Edit: I see now that the work item is talking about a specific DB repository, not the file system repository (as has been the culprit in other work items which have been opened). In your particular case, the 20ms delay in the database itself violates a fundamental assumption of the DataProtection layer: all machines must have a consistent view of the key ring (IXmlRepository). If Machine A updates the repository at time t = 0, and if Machine B queries the repository at time t = 1, the changes made by machine A must be visible to Machine B.

kevinlo · 2019-06-20T17:38:40Z

@GrabYourPitchforks I prefer to use file system for the data protection key, but my architect choose DB. Do you have any documentation that I can show my architect that IXmlRepository needs to be quasi-transactional?

Back to your file system case, could you please explain more details how it works. When you say

if the file "key-00000000.xml" doesn't exist

Do you mean it does not exist in the application memory or in the file system?

E.g when both App1 and App2 are started, they both try to read the keys from the file system. If the file system is empy, they both try to create a new key and write "key-00000000.xml". Say App1 successfully writes it and App2 fails. Does App2 retry to refresh the key ring first before it tries to write it as "key-00000001.xml"? If App2 does not refresh and write it as "key-00000001.xml", how does App1 and App2 to refresh and get both keys. As you said the key name has nothing to do with the content, it is possible the key name with larger number has older key than the one with smaller number, right?

GrabYourPitchforks · 2019-06-21T02:37:45Z

Do you have any documentation that I can show my architect that IXmlRepository needs to be quasi-transactional?

I don't know if the docs say this explicitly. When we wrote the docs we were primarily concerned with things like correct usage of the APIs and describing configurable behaviors. We do try to write down assumptions made by the APIs, but on occasion we forget to list something because it never occurs to us that somebody might violate a behavior that we treat as a core assumption. In this case, the assumption is that changes made to the repository by one client are immediately visible to all other clients. And I can say as the original author of the DataProtection stack that this was definitely an assumption. :)

Whether this is a good assumption is a different matter, and perhaps that's a debate worth having, but then this moves from "bug" to "feature request". And even if the feature request is approved, there's no guarantee that IXmlRepository-derived types would ever be able to support the requested behavior. Perhaps an IXmlRepository2 or similar would be introduced for extension points that need the new behavior.

As you said the key name has nothing to do with the content, it is possible the key name with larger number has older key than the one with smaller number, right?

That's correct. If there's a conflict at write time, the keyring would need to refresh so that it would see any keys written by the client that "won" the race. You may even be able to perform this trick inside your database layer. I assume that there's some kind of consistency mechanism in the database so that two clients don't try to write the same entry within a single 200 millisecond window. If the DB can somehow detect "Hey - you're asking me to write 00000 but somebody has already beaten you to it. Wait a few milliseconds and try querying to see the value." then you might be able to fulfill the necessary assumptions within your repository.

Hope this helps!

kevinlo · 2019-06-21T16:48:42Z

@GrabYourPitchforks I think that assumption make sense, but now we have Couchbase consistency issue with data vs. index. Even with the quasi-transactional file system, do you think it still has the key renew case that I ask @HaoK?

If there's a conflict at write time, the keyring would need to refresh

My question is how to refresh the keyring from my app. At first, I submit this issue expecting ASP.NET Core would do it. But blowdart says it can cause DoS. From the Key management extensibility doc, it says the IKeyManager can Get all keys from storage, but it also has the warnings:

Writing an IKeyManager is a very advanced task, and the majority of developers shouldn't attempt it. Instead, most developers should take advantage of the facilities offered by the XmlKeyManager class.

Do you have sample codes how to force the App to refresh the keys?

HaoK · 2019-07-09T20:42:56Z

Fixed by #11987

natemcmaster added the area-dataprotection Includes: DataProtection label Nov 8, 2018

natemcmaster added Needs: Design This issue requires design work before implementating. 1 - Ready labels Apr 15, 2019

natemcmaster self-assigned this Apr 15, 2019

natemcmaster added this to the 3.0.0-preview7 milestone Apr 15, 2019

natemcmaster added the PRI: 3 - Optional label Jun 12, 2019

Pilchie unassigned natemcmaster Jun 19, 2019

HaoK self-assigned this Jun 19, 2019

HaoK modified the milestones: 3.0.0-preview7, 3.0.0-preview8 Jun 19, 2019

GrabYourPitchforks mentioned this issue Jun 21, 2019

AddDataProtection race condition #3514

Closed

HaoK mentioned this issue Jul 9, 2019

DataProtection: Refresh key rings for a short window #11987

Merged

HaoK added Done This issue has been fixed and removed 1 - Ready labels Jul 9, 2019

HaoK removed the Needs: Design This issue requires design work before implementating. label Jul 9, 2019

HaoK closed this as completed Jul 9, 2019

aKzenT mentioned this issue Aug 27, 2019

DataProtection permanently fails to recover when backing storage is purged #13476

Closed

ghost locked as resolved and limited conversation to collaborators Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataProtection: IDataProtector.UnProtect can force refresh the key ring if it fails to unprotect #3975

DataProtection: IDataProtector.UnProtect can force refresh the key ring if it fails to unprotect #3975

kevinlo commented Nov 8, 2018

blowdart commented Nov 8, 2018

kevinlo commented Nov 8, 2018

natemcmaster commented Apr 15, 2019

kevinlo commented Apr 29, 2019

natemcmaster commented Jun 12, 2019

Pilchie commented Jun 19, 2019

HaoK commented Jun 19, 2019

HaoK commented Jun 19, 2019

HaoK commented Jun 19, 2019

GrabYourPitchforks commented Jun 19, 2019

kevinlo commented Jun 19, 2019 •

edited

Loading

GrabYourPitchforks commented Jun 19, 2019 •

edited

Loading

kevinlo commented Jun 20, 2019

GrabYourPitchforks commented Jun 21, 2019

kevinlo commented Jun 21, 2019

HaoK commented Jul 9, 2019

DataProtection: IDataProtector.UnProtect can force refresh the key ring if it fails to unprotect #3975

DataProtection: IDataProtector.UnProtect can force refresh the key ring if it fails to unprotect #3975

Comments

kevinlo commented Nov 8, 2018

blowdart commented Nov 8, 2018

kevinlo commented Nov 8, 2018

natemcmaster commented Apr 15, 2019

kevinlo commented Apr 29, 2019

natemcmaster commented Jun 12, 2019

Pilchie commented Jun 19, 2019

HaoK commented Jun 19, 2019

HaoK commented Jun 19, 2019

HaoK commented Jun 19, 2019

GrabYourPitchforks commented Jun 19, 2019

kevinlo commented Jun 19, 2019 • edited Loading

GrabYourPitchforks commented Jun 19, 2019 • edited Loading

kevinlo commented Jun 20, 2019

GrabYourPitchforks commented Jun 21, 2019

kevinlo commented Jun 21, 2019

HaoK commented Jul 9, 2019

kevinlo commented Jun 19, 2019 •

edited

Loading

GrabYourPitchforks commented Jun 19, 2019 •

edited

Loading