-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle cached result column names and rows separately #6504
Conversation
Hi! After reading this thread, I think the result cache is unlikely to get cleared between deploys. Because of that, there will be a performance hit. What else will there be? How will DBAL behave when attempting to unserialize a row cached by the previous version? I think an intermediate solution between doing nothing and migrating the cache at runtime could be to discard cache entries that look like they were generated by the previous version, also at runtime. Would that be significantly less effort? |
👋
I don't think there are any other implications. I can imagine that unless cache is implemented ideally (which is rarely the case), it needs to be cleared every now and then. A dependency update looks like an adequate reason to clear cache.
The current implementation doesn't support this scenario. Likely, there will be some notices like "Undefined index" and maybe more severe errors.
Yeah, this behavior looks reasonable in general but it will have the same performance implications as clearing cache by the user. The downside of it is if the code is capable of ignoring cache by design, it's prone to the issues where an error in the code or infrastructure, instead of manifesting in application-level errors will manifest in performance degradation. These situations are much more subtle and harder to diagnose. As an operator of such an application, I would be comfortable with seeing the errors, reading the upgrade documentation, clearing the cache and getting the problem solved. I would be less calm if I knew that this application can ignore cache w/o reporting it in any way. |
I see what you mean. Here is another idea: what if, upon detecting an old version of the cache, we threw a dedicated exception instead of letting the users get to the errors about undefined index and so on? I'm assuming there is a reliable way to detect old versions of the cache, and that there aren't too many places in the code where that would need to be done. In any case, I think this should ship in 4.2.0, because a good upgrade guide should make that OK. |
I think we should reproduce the exact scenario of the cache format mismatch and act based on that. Besides that, since we're discussing this, I though maybe instead of storing a tuple of column names and rows in the cache, we can split |
Throwing an exception will unnecessarily complicate the upgrade. It will block the execution of the application and will prevent it from naturally populating cache with the data formatted the new way. If it was a warning, the operator could either see them and clear cache or even ignore them and get the cache naturally updated over time. Caching is not a functional feature of an application, so its misbehavior shouldn't be silenced but shouldn't be blocking either. |
Sorry, I don't understand what you mean by that, can you elaborate? Regarding the warning, I think that would be a sensible solution indeed 👍 |
In my current version of the feature,
The column names and rows are stored in cache as a tuple:
Using a tuple looks clumsy because:
I propose that we break readonly class CachedData {
public function __construct(private array $columnNames, private array $rows) {}
}
class CachedResult implements Result {
private int $num = 0;
public function __construct(private CachedData $data) {}
} This way, Another idea (which doesn't invalidate but complements the above) is that instead of storing an associative array of results in the cache item (see below), we could store an object there. Lines 808 to 811 in 7c9d983
This way, instead of validating, discarding and updating elements of this array one by one, we could discard the whole value, if it's not an object of the expected class. Even though this is unlikely, if we need to change the cache format again, this approach will allow the introduction of format versioning and further automatic cache invalidation during upgrades. |
Thanks for the explanation, I think moving towards more object is a good idea. Using objects will result in a slightly bigger cache because of the way |
Thank you for working on this issue! We must be able to handle an old cache gracefully. We can recommend clearing the cache, we can document it, and people will still deploy without clearing the cache. Not clearing the cache is fine in ~99% of the deployments and it's the favorable option if the alternative is starting with a cold cache after deployment. We should either upgrade to discard old cache entries. If we don't want to deal with them at all, we could just change our naming strategy for cache keys: Change the strategy for cache key generationThis way, we will never read an old entry and since we don't access them anymore, they should be purged from the cache eventually. However, this might mean that the cache might grow to twice its usual size for a short period of time. Also, the app will effectively have a cold cache after deployment. Detect and discard old cache data when readingOld cache data would be treated as if it were expired. This adds a little complexity to our code which we might want to remove in 5.0. On the other hand, the cache size should remain roughly the same. The detection logic however could have a performance impact that we should keep as low as possible. Apart from that, the effect would be the same as in the previous option. Detect and upgrade cache data when readingOld cache data can easily be upgraded to the new format. This way, the app would keep its warm cache, but the data transformation will have a runtime impact. If that impact is lower than a full roundtrip to the database, it might be worth it. We could even write back the upgraded data to the cache. However, since the old entries should expire eventually, that should not really be necessary. |
We might have another inconsistency with regards to column names: How should a result behave if we call Which behavior do we actually want? |
I forgot to mention this but the current implementation is a deliberate choice (not necessarily correct). Lines 91 to 94 in a41d81d
So I believe we should retain the metadata upon clearing. |
Personally, I'm not sold on the improvement of the upgrade experience at the cost of complicating the code at this point. It's not data, it's cache. My major concern is not really about the code (which should be easy to write), it's primarily about the scenarios which we need to account for while designing the upgrade logic. E.g. how will a given solution work in a rolling upgrade scenario where multiple processes running different versions of the DBAL will read from and write to the cache? This eliminates the approach of upgrading cache in place because it will break the process(es) still running the old version. The second concern is automated testing. The upgrade scenarios would have to be tested somehow, and we don't have the tooling for testing upgrades. It means that we'd have to build some one-off tooling for that. With that said, I will totally understand if the proposed change in its current form is not acceptable for a minor release (maybe we can move it to the next major) or at all. But I'm not interested in building automation for a scenario that could be handled by clearing cache. Hopefully, it doesn't sound like I'm irritated or upset. I don't have any attachment to this pull request and am just stating the facts. |
I'd be happy if the new version simply does not choke on the old cache format because I'm pretty sure that this is going to backfire at us. The code we would need for that should be less annoying to maintain than the tickets we will need to close.
I wouldn't overthink this. Just pre-populate an array cache with the old format and try reading a result from it.
Shall I take over the part with the cache upgrading? Either way, we should document in our UPGRADE.md file that the result cache format will change and recommend to clear the caches when upgrading. |
That would be great. Thank you! |
All right. Let's ignore cache upgrading for now and I'll work on that in a follow-up. Can you please rebase your PR to resolve the conflicts and document the cache change in UPGRADE.md? |
@derrabus, done. |
…buh) This PR was merged into the 5.4 branch. Discussion ---------- [Messenger] fix test to be compatible with DBAL 4.2 | Q | A | ------------- | --- | Branch? | 7.1 | Bug fix? | no | New feature? | no | Deprecations? | no | Issues | | License | MIT see doctrine/dbal#6504 Commits ------- 447bf7e fix test to be compatible with DBAL 4.2
| Q | A |------------- | ----------- | Type | improvement | Fixed issues | Follows #6504 #### Summary I'm changing the query cache format once again: Instead of storing a tuple of arrays, I'm now storing the whole `ArrayResult` object. This way, we can easily detect if we support the data returned from the cache and we can handle future cache format changes in the `__unserialize()` methods. After #6504, we would run into errors if an app would attempt to us a cache written with DBAL 4.1. With my change, the old cache is ignored and the app would behave as if it had encountered a cache miss instead. I've also added tests covering the serialization of `ArrayResult` objects.
Summary
Prior to #6428, there were two problems with caching results by design:
SELECT a.id, b.id FROM a, b
), they will clash even if fetched in the numeric mode.See #6428 (comment) for reference.
The solution is:
fetchNumeric()
.Additional notes:
TestUtil::generateResultSetQuery()
were necessary for the integration testing of the changes inArrayResult
. However, later, I realized that the modified code is database-agnostic, so running it in integration with each database platform would effectively test the same code path and would be redundant. I replaced the integration tests with the unit ones but left the change inTestUtil
as it makes the calling code cleaner.ArrayStatementTest
was renamed toArrayResultTest
, which Git doesn't seem to recognize due to the amount of changes.TODO: