-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate Solr performance issues (again) #10469
Comments
@landreev do you have a size estimate for how much time you'd want to spend on a preliminary investigation? (Understanding, of course, that a deep dive could be a major undertaking) |
Re: solr logs: I realized that the "slow query" logging was not on in the solr config either (going to turn it back on in a moment). |
I'm not sure if this is helpful or not but @landreev invited us to put ideas in this issue. Perhaps we should investigate distributing the load with SolrCloud. As I mentioned in Slack, a contributor tried to add it 8 years ago in #2985 but we weren't ready to think about it. Fast forward to today and we had a recent conversation with Chris Tate from Red Hat who talked quite a bit about how much he recommends SolrCloud in the 2024-03-21 Containerization Working Group meeting (notes, recording). |
+1 for a shift to solrCloud! |
This is the definition of "done" for this effort/this phase of it:
|
Some rambling: FWIW: In creating #10579, I've discovered that Dataverse is creating a datafile_id_draft_permission doc in the case when a file is published and it's file metadata hasn't changed, in which case the code at
In looking through the code, it's also clear that the code checked linked above misses many cases where we shouldn't have to update the doc, e.g. when a file only exists in a draft and the dataset is updated, or when a new version is published (although the datasetVersionId would change in this case). Refactoring to use the last modified date for a file to indicate when it (it's file metadata) last changed, versus when the dataset was last modified, might provide a better way to detect when reindexing a file is required compared to looking for filemetadata changes between versions. Atomic updates could also be a big improvement - for full text indexing for sure, and also in combination with a date-based check (so we could just update the datasetVersionId for example). For this we might need to know that the doc already exists, but we're already doing the necessary query (used now to find which docs aren't needed, but easy to keep track of which docs exist that will be needed). |
I'll need to re-read the above to fully grasp it, but sounds like a good lead/another area where we can optimize things further. |
Oh - yes :-) I think there were other orphan dataset /perms docs being created, but #10579 should remove those/no longer generate them. The scenario above is the one I didn't get rid of. |
(Sorry, I haven't been paying attention to the developments in #10579, but will catch up next week!) |
Since this is now meeting the definition of "done" for this phase as defined above (with the first patch containing #10547 and #10555 deployed in prod.), going to close it, for accounting purposes etc. |
Solr-related issues are becoming quite severe in IQSS prod. There's some anecdotal evidence of other instances experiencing this as well, so opening the issue here, in the main repo.
At IQSS, there are general reports from users that searching is becoming increasingly slow. There are also datasets that fail to get indexed, on creation and/or publication. Initial harvests of large remote archives (thousands+ of datasets; something we used to be able to do routinely) no longer work - the records get successfully harvested and imported, but indexing stops after 2-3 K, sometimes hundreds of records. A recently merged PR (#10388) addresses indexing performance by insuring that only a limited number of indexing jobs can be executed in parallel, via semaphores. We will see what effect it has on the performance once 6.2 is deployed. However, I am worried that parallel/simultaneous execution may not be the only, or the main issue. Trying to reindex a huge collection of harvested datasets, even with sleeping for 10 or 15 sec. between calls, it appears that solr becomes overwhelmed 3K or so datasets in. Is this a memory leak? Is this something exponential of the number of datasets in a collection that we have introduced recently? Is this simply a side effect of upgrading to Solr 9* (it appears that the problem has become especially severe since the 6.0 upgrade).
The text was updated successfully, but these errors were encountered: