Investigate Solr performance issues (again) #10469

landreev · 2024-04-08T14:38:03Z

Solr-related issues are becoming quite severe in IQSS prod. There's some anecdotal evidence of other instances experiencing this as well, so opening the issue here, in the main repo.

At IQSS, there are general reports from users that searching is becoming increasingly slow. There are also datasets that fail to get indexed, on creation and/or publication. Initial harvests of large remote archives (thousands+ of datasets; something we used to be able to do routinely) no longer work - the records get successfully harvested and imported, but indexing stops after 2-3 K, sometimes hundreds of records. A recently merged PR (#10388) addresses indexing performance by insuring that only a limited number of indexing jobs can be executed in parallel, via semaphores. We will see what effect it has on the performance once 6.2 is deployed. However, I am worried that parallel/simultaneous execution may not be the only, or the main issue. Trying to reindex a huge collection of harvested datasets, even with sleeping for 10 or 15 sec. between calls, it appears that solr becomes overwhelmed 3K or so datasets in. Is this a memory leak? Is this something exponential of the number of datasets in a collection that we have introduced recently? Is this simply a side effect of upgrading to Solr 9* (it appears that the problem has become especially severe since the 6.0 upgrade).

cmbz · 2024-04-19T19:15:18Z

@landreev do you have a size estimate for how much time you'd want to spend on a preliminary investigation? (Understanding, of course, that a deep dive could be a major undertaking)

landreev · 2024-05-06T14:33:51Z

Re: solr logs: I realized that the "slow query" logging was not on in the solr config either (going to turn it back on in a moment).
Otherwise, putting up a snapshot of the currently available logs on demo (?) or someplace else where they are available to the dev. team.

pdurbin · 2024-05-06T19:01:44Z

I'm not sure if this is helpful or not but @landreev invited us to put ideas in this issue.

Perhaps we should investigate distributing the load with SolrCloud. As I mentioned in Slack, a contributor tried to add it 8 years ago in #2985 but we weren't ready to think about it.

Fast forward to today and we had a recent conversation with Chris Tate from Red Hat who talked quite a bit about how much he recommends SolrCloud in the 2024-03-21 Containerization Working Group meeting (notes, recording).

johannes-darms · 2024-05-14T08:09:30Z

+1 for a shift to solrCloud!

landreev · 2024-05-29T14:32:05Z

This is the definition of "done" for this effort/this phase of it:

Merge the solr-related PRs currently in the works:

Solr: Try Soft Commit on Indexing #10547 by @qqmyers
avoid expensive Solr join for public dvObjects in search (experimental) #10555 by @pdurbin and @landreev
Solr throttling #10558 by @ErykKul
Extra settings for limiting search facets 10570 #10590 by @landreev
[Possibly/optionally?] Solr: don't delete docs that will just change #10579 by @qqmyers - this one is still a Draft and hasn't been tested as much as the others; but appears to be promising (tests are mostly passing).

Build a deployable prod. patch; ideally with all of the above, but, at a minimum, 10555, for starters.
Deploy the above in prod. Debating if we should do this when the prod. solr instance is upgraded to RedHat 8 by LTS (which will require some search engine downtime), tentatively scheduled for June 14. May also use the opportunity to force a full reindex. [Maybe worth deploying something prior to that - 10555 specifically?]
New issues opened for all the known problems and/or potential optimizations not covered in the PRs above [pending].

qqmyers · 2024-06-13T20:31:08Z

Some rambling: FWIW: In creating #10579, I've discovered that Dataverse is creating a datafile_id_draft_permission doc in the case when a file is published and it's file metadata hasn't changed, in which case the code at

dataverse/src/main/java/edu/harvard/iq/dataverse/search/IndexServiceBean.java

Line 1241 in ad58f3e

indexThisMetadata = false;

avoids creating a datafile_id_draft doc. That produces an orphan permission doc. Changes in #10579 now allow the status and clear-orphan calls to find/delete these, but I decide it was out of scope to dig into the permission doc creation code to find a way to suppress creating that doc.

In looking through the code, it's also clear that the code checked linked above misses many cases where we shouldn't have to update the doc, e.g. when a file only exists in a draft and the dataset is updated, or when a new version is published (although the datasetVersionId would change in this case). Refactoring to use the last modified date for a file to indicate when it (it's file metadata) last changed, versus when the dataset was last modified, might provide a better way to detect when reindexing a file is required compared to looking for filemetadata changes between versions. Atomic updates could also be a big improvement - for full text indexing for sure, and also in combination with a date-based check (so we could just update the datasetVersionId for example). For this we might need to know that the doc already exists, but we're already doing the necessary query (used now to find which docs aren't needed, but easy to keep track of which docs exist that will be needed).

landreev · 2024-06-13T21:20:04Z

I'll need to re-read the above to fully grasp it, but sounds like a good lead/another area where we can optimize things further.
The scenario described would not fully account for #10597, would it? - I'm fairly positive we end up with unnecessary permission docs for datasets as well, not just files.

qqmyers · 2024-06-13T21:23:41Z

Oh - yes :-) I think there were other orphan dataset /perms docs being created, but #10579 should remove those/no longer generate them. The scenario above is the one I didn't get rid of.

landreev · 2024-06-13T21:43:50Z

(Sorry, I haven't been paying attention to the developments in #10579, but will catch up next week!)

landreev · 2024-06-20T18:44:32Z

Since this is now meeting the definition of "done" for this phase as defined above (with the first patch containing #10547 and #10555 deployed in prod.), going to close it, for accounting purposes etc.
We should open a new parent issue for keeping track of the individual solr-related optimization PRs, and for any further steps in this effort.

landreev mentioned this issue Apr 8, 2024

Re-harvest from Borealis Repository IQSS/dataverse.harvard.edu#172

Open

cmbz added this to IQSS Dataverse Project Apr 18, 2024

cmbz added the GREI 3 Search and Browse label Apr 18, 2024

cmbz mentioned this issue Apr 18, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

57 tasks

jggautier mentioned this issue Apr 22, 2024

Harvest from DANS Data Stations IQSS/dataverse.harvard.edu#266

Open

landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Apr 25, 2024

cmbz moved this to SPRINT READY in IQSS Dataverse Project May 2, 2024

landreev moved this from SPRINT READY to In Progress 💻 in IQSS Dataverse Project May 8, 2024

landreev self-assigned this May 8, 2024

cmbz added Size: 80 A percentage of a sprint. 56 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels May 8, 2024

This was referenced May 9, 2024

Document how to use ZGC for Solr #10548

Closed

Avoid expensive Solr join when Guest users search (affects IP Groups) #10554

Closed

avoid expensive Solr join for public dvObjects in search (experimental) #10555

Merged

ErykKul mentioned this issue May 14, 2024

Solr throttling #10558

Open

landreev mentioned this issue May 17, 2024

More indexing and searching improvements #10570

Closed

cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed Size: 80 A percentage of a sprint. 56 hours. labels May 22, 2024

pdurbin added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label May 29, 2024

qqmyers mentioned this issue Jun 13, 2024

Solr: don't delete docs that will just change #10579

Merged

landreev closed this as completed Jun 20, 2024

landreev moved this from In Progress 💻 to Done 🧹 in IQSS Dataverse Project Jun 20, 2024

DS-INRAE added this to Recherche Data Gouv Jul 10, 2024

DS-INRAE moved this to Done in Recherche Data Gouv Jul 10, 2024

cmbz removed this from IQSS Dataverse Project Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Solr performance issues (again) #10469

Investigate Solr performance issues (again) #10469

landreev commented Apr 8, 2024

cmbz commented Apr 19, 2024

landreev commented May 6, 2024 •

edited

Loading

pdurbin commented May 6, 2024

johannes-darms commented May 14, 2024

landreev commented May 29, 2024

qqmyers commented Jun 13, 2024

landreev commented Jun 13, 2024

qqmyers commented Jun 13, 2024

landreev commented Jun 13, 2024

landreev commented Jun 20, 2024

Investigate Solr performance issues (again) #10469

Investigate Solr performance issues (again) #10469

Comments

landreev commented Apr 8, 2024

cmbz commented Apr 19, 2024

landreev commented May 6, 2024 • edited Loading

pdurbin commented May 6, 2024

johannes-darms commented May 14, 2024

landreev commented May 29, 2024

qqmyers commented Jun 13, 2024

landreev commented Jun 13, 2024

qqmyers commented Jun 13, 2024

landreev commented Jun 13, 2024

landreev commented Jun 20, 2024

landreev commented May 6, 2024 •

edited

Loading