-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataProtection permanently fails to recover when backing storage is purged #13476
Comments
Data Protection is built on the premise that the key ring is persisted and won't go away. As Redis can be configured to persist data in case of a restart that should be where we'd advise people to go rather than start to insert code to support a scenario it was never designed for. |
I understand that. However data persistance is only available in the azure premium tiers that start at ~400 dollars a month. This will not be an option for many if not most users of the framework. And even with that, no service will offer a 100% guarantee of data persistance. Failures can and will happen. As I said I would completely understand that existing sessions cannot be restored in this case, but I think it is unfortunate that the framework essentially gets in a unrecoverable "corrupted" state in these situations without any way to recover from it. My suggestion would improve these situations without any fundamental changes to the framework. My guess is that the changes to the code would be quite small and might even make the previous change from #3975 obsolete. |
In which case blob storage is an option, it's not like data protection needs redis, it's a read operation that happens once every couple of weeks. |
I think blob storage is certainly an option which we will look into to avoid the problem in the future in case the framework remains as-is. I still think it would be a good idea to make the Redis provider more resilient out of the box as most users won't be aware of this potential issue and using Redis instead of Blob-Storage is a natural choice for many projects that already use Redis for caching. At the very least the documentation should have a prominent warning that using Redis is only recommended if persistance is used as otherwise the application can become broken until manual restart. |
A doc update is certainly a good start. @Rick-Anderson could you compose something suitable scary? |
@aKzenT I plan on adding the scary warning in Configure ASP.NET Core Data Protection and add links to that section in ASP.NET Core Data Protection |
Thanks Rick :) |
Describe the bug
If several applications/workers share the same key ring, e.g. using the Redis-provider, the DataProtection keys between the various instances get out of sync if the backing storage is purged (e.g. redis server restart) and one of the workers is restarted (or another instance is started).
We had this issue happen on a standard azure hosting setup using a redis server and a web app with automatic scaling. The Redis Server lost its cache entries, probably because of a maintenance restart. The running instances were unaffected as they use the key ring loaded on startup and the keys were not expiring for a long time. However when a new instance was started, it did not find the key ring in the redis cache and created a new key which was instantly activated. Data encrypted by this new instance was then no longer decryptable by the previous instances causing massive failures on the production system ("Key {...} was not found in the key ring").
I understand that purging of the key ring is not a directly "supported" scenario and that at the very least it will cause some encrypted data to be permantly lost. However I do believe that the framework should be robust enough that it will recover from this situation within a reasonable amount of time; especially when it can occur in a very common hosting setup. The only way to get our instances back in sync was to restart all instances manually. Otherwise the issue would probably have persisted for several days until either the original keys are expired or all instances have been recycled for other reasons.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Errors may occur, but a recovery mechanism is in place which will automatically reload the keyring if necessary.
I see that similar issues were discussed in issue #3975 and code was added that allows automatic key refresh in the first minutes of an application lifetime. However this addition is not enough to fix the problem described here as it does not occur during application startup. IIRC there were also problems described in the references issue just after key expiration, which might also not be adressed by the original fix.
As I understand it we do not want to simply refresh the key ring if an unknown key is found as this might lead to a possible DoS attack, where an attacker can simply send a lot of unknown keys forcing the DP framework to constantly go to the backing store.
My proposal is to use a sliding window approach to mitigate this concern. The first time we encounter an unknown key we will refresh the key ring from the backing store and then block additional refresh for a certain amount of time (e.g. 5 minutes). This would fix this and related issues and still prevent DoS attacks.
I would be willing to work on a PR if the overall idea is acceptable to you.
The text was updated successfully, but these errors were encountered: