-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues as a side effect of the file lookup optimizations #9725
Comments
To clarify, we have no reason to assume that the code underneath tries to keep the entire output of the lookup query in memory all at once. (although the way that particular dataset crashed the application on our performance test system at least once was suspicious). I would assume that the code goes through the list of the hits sequentially and builds the entities out of it as needed, which does not require caching it all. Still, going through 2 million hits and Gigabytes of data in order to build a list of 2,200 files - feels a bit excessive (?), even if it does not make the application run out of memory. What makes the specific dataset even trickier is that it has 220 versions, with the 2,200 files spread between them. There are only 12 files in the last published version. And that's all the files that need to be looked up, for the purposes of the dataset page load - but the new code goes to all the trouble above to look up all of them... This is an exotic case, but we definitely have more like it. AND we are seeing performance degradation on datasets with less exotic structure too. The number of guestbookresponse entries appear to be the factor in the cases observed so far; but I don't think it is safe to assume that there are no other cases where some other type of object that cascades off of DataFile would not cause a similar issue. |
(documenting what was discussed locally on slack) (while it is of course possible to add a hint in that method for the FileDownload entities as well ( The performance is still not very good on datasets with more exotic versions structure, like the one described in the previous comment (2,200 files in 220 versions): while the page loads in 3-4 seconds under 5.13 (there are only 12 files in the latest published versions), it takes more than a minute under the current release candidate. We don't have that many datasets like this one, but we do have them, so we need to figure out a way to address this. As an experiment, I'm trying to put together a similar I am still not 100% sure that there are no other special/exotic cases where this single query approach may become problematic ("too many" of something else in the tree structure of the dataset/version... too many fileAccessRequests? too many... categories?? - idk...) so we need to think about this. Also, we will retest more carefully, on more datasets, how the performance of the dataset page in the current state (with the GuestBookResponse lookup removed) compares to that of 5.13. |
… "findDeep" on the requested version, instead of the full dataset. (Needs more testing). #9725
The approach I described above - make the dataset page use an optimization similar to the |
In our, IQSS production instance, this solution - retrieving all the files + nested objects with one query with multiple left joins - appears to become somewhat counter-productive with datasets with roughly 10,000+ files; at which point the time it takes to execute that single query exceeds the time it takes to load the page in 5.13. We only have ~10 such datasets; and the degradation is < the factor of 2 in the worst case. Considering that for most of our datasets the performance is better than under 5.13, we can probably live with that in a release. As a longer term solution, we will need to modify the page so that it does not need to initialize ALL the files in the version, as opposed to a page-worth of files that we are actually displaying. We will also want to create an API for this paging-style retrieval of files. We have an issue already scheduled for this, so we can start working on it immediately after 5.14. (#9763) |
(the good news is that this back end pagination is already being addressed in #9693). |
I said earlier:
I may have finally gotten it to the point where the above is no longer true; mostly by further trimming the lookup query. The monstrous dataset we have in prod. that has 42K files now loads much faster than under 5.13. |
Adding @ErykKul, since there may be some kinks that need to be ironed out after the optimization PR #9558 was merged.
The PR above has significantly improved the performance in many scenarios, but there may be some unintended consequences/side effects. As part of releasing v5.14, performance tests were run on some of the IQSS real life prod. datasets and we noticed that performance of the dataset page dropped for some datasets, for some of them catastrophically so. It did not appear to be the direct function of the number of files in the dataset, but appears to be related to the number of cascade-related objects. Specifically, the first thing I noticed was that there was now 1 extra query for every entry in the guestbookresponse table for every file in the dataset. i.e.:
for one specific dataset that became particularly unusable, that was 68K extra queries - but we definitely have datasets with many more downloads. This is the result of the 1:1 relationship between GuestbookResponse and FileDownload. It's not just the GuestBookResponse though. It appears that the optimization underneath relies on multiple
LEFT OUTER JOIN
in the query that's run underneath to get all the files in one pass. Here's the monstrous query issued:because of all these JOINs for this specific dataset (2,200 files) the query, if run on the command line with psql takes 3.5 minutes and produces 2M hits and 9.5GB of output (that's text output of course; it should take less as memory objects, but is most likely still prohibitive for the application to handle). Dropping the guestbookresponse from the query (i.e., removing all the t13 entries from it) makes it run in 1+ minute and produce 46K hits and 200MB of output. (this is probably the equivalent of dropping
setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses")
from the code? - not sure). Haven't tested yet if this make the page work for the dataset.There are definitely many more repeated queries on many other tables; it was just the filedownload ones that were the most dramatic and jumped at me right away.
The text was updated successfully, but these errors were encountered: