DataProtection permanently fails to recover when backing storage is purged #13476

aKzenT · 2019-08-27T16:08:31Z

Describe the bug

If several applications/workers share the same key ring, e.g. using the Redis-provider, the DataProtection keys between the various instances get out of sync if the backing storage is purged (e.g. redis server restart) and one of the workers is restarted (or another instance is started).

We had this issue happen on a standard azure hosting setup using a redis server and a web app with automatic scaling. The Redis Server lost its cache entries, probably because of a maintenance restart. The running instances were unaffected as they use the key ring loaded on startup and the keys were not expiring for a long time. However when a new instance was started, it did not find the key ring in the redis cache and created a new key which was instantly activated. Data encrypted by this new instance was then no longer decryptable by the previous instances causing massive failures on the production system ("Key {...} was not found in the key ring").

I understand that purging of the key ring is not a directly "supported" scenario and that at the very least it will cause some encrypted data to be permantly lost. However I do believe that the framework should be robust enough that it will recover from this situation within a reasonable amount of time; especially when it can occur in a very common hosting setup. The only way to get our instances back in sync was to restart all instances manually. Otherwise the issue would probably have persisted for several days until either the original keys are expired or all instances have been recycled for other reasons.

To Reproduce

Steps to reproduce the behavior:

Using the current version of ASP.NET Core
Run multiple instances of an application using Redis to synchronize data protection keys
Restart redis server or otherwise clear the keys
Start another instance of the application
Data protected by the new instance is no longer readable by the old instances

Expected behavior

Errors may occur, but a recovery mechanism is in place which will automatically reload the keyring if necessary.

I see that similar issues were discussed in issue #3975 and code was added that allows automatic key refresh in the first minutes of an application lifetime. However this addition is not enough to fix the problem described here as it does not occur during application startup. IIRC there were also problems described in the references issue just after key expiration, which might also not be adressed by the original fix.

As I understand it we do not want to simply refresh the key ring if an unknown key is found as this might lead to a possible DoS attack, where an attacker can simply send a lot of unknown keys forcing the DP framework to constantly go to the backing store.

My proposal is to use a sliding window approach to mitigate this concern. The first time we encounter an unknown key we will refresh the key ring from the backing store and then block additional refresh for a certain amount of time (e.g. 5 minutes). This would fix this and related issues and still prevent DoS attacks.

I would be willing to work on a PR if the overall idea is acceptable to you.

blowdart · 2019-08-27T16:41:03Z

Data Protection is built on the premise that the key ring is persisted and won't go away. As Redis can be configured to persist data in case of a restart that should be where we'd advise people to go rather than start to insert code to support a scenario it was never designed for.

aKzenT · 2019-08-27T17:52:38Z

I understand that. However data persistance is only available in the azure premium tiers that start at ~400 dollars a month. This will not be an option for many if not most users of the framework. And even with that, no service will offer a 100% guarantee of data persistance. Failures can and will happen.

As I said I would completely understand that existing sessions cannot be restored in this case, but I think it is unfortunate that the framework essentially gets in a unrecoverable "corrupted" state in these situations without any way to recover from it.

My suggestion would improve these situations without any fundamental changes to the framework. My guess is that the changes to the code would be quite small and might even make the previous change from #3975 obsolete.

blowdart · 2019-08-27T18:36:23Z

In which case blob storage is an option, it's not like data protection needs redis, it's a read operation that happens once every couple of weeks.

aKzenT · 2019-08-28T09:07:40Z

I think blob storage is certainly an option which we will look into to avoid the problem in the future in case the framework remains as-is. I still think it would be a good idea to make the Redis provider more resilient out of the box as most users won't be aware of this potential issue and using Redis instead of Blob-Storage is a natural choice for many projects that already use Redis for caching.

At the very least the documentation should have a prominent warning that using Redis is only recommended if persistance is used as otherwise the application can become broken until manual restart.

blowdart · 2019-08-28T17:36:53Z

A doc update is certainly a good start. @Rick-Anderson could you compose something suitable scary?

Rick-Anderson · 2019-08-28T17:55:17Z

@aKzenT I plan on adding the scary warning in Configure ASP.NET Core Data Protection and add links to that section in ASP.NET Core Data Protection
I'll ask you to review my warning too if that's OK.

blowdart · 2019-08-28T18:08:44Z

Thanks Rick :)

blowdart added the area-dataprotection Includes: DataProtection label Aug 27, 2019

Rick-Anderson mentioned this issue Aug 28, 2019

Add warning on data protection and non-persistent Redis dotnet/AspNetCore.Docs#14065

Closed

blowdart closed this as completed Oct 2, 2019

ghost locked as resolved and limited conversation to collaborators Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataProtection permanently fails to recover when backing storage is purged #13476

DataProtection permanently fails to recover when backing storage is purged #13476

aKzenT commented Aug 27, 2019 •

edited

Loading

blowdart commented Aug 27, 2019

aKzenT commented Aug 27, 2019 •

edited

Loading

blowdart commented Aug 27, 2019

aKzenT commented Aug 28, 2019

blowdart commented Aug 28, 2019

Rick-Anderson commented Aug 28, 2019

blowdart commented Aug 28, 2019

DataProtection permanently fails to recover when backing storage is purged #13476

DataProtection permanently fails to recover when backing storage is purged #13476

Comments

aKzenT commented Aug 27, 2019 • edited Loading

Describe the bug

To Reproduce

Expected behavior

blowdart commented Aug 27, 2019

aKzenT commented Aug 27, 2019 • edited Loading

blowdart commented Aug 27, 2019

aKzenT commented Aug 28, 2019

blowdart commented Aug 28, 2019

Rick-Anderson commented Aug 28, 2019

blowdart commented Aug 28, 2019

aKzenT commented Aug 27, 2019 •

edited

Loading

aKzenT commented Aug 27, 2019 •

edited

Loading