Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Solr performance issues (again) #10469

Closed
landreev opened this issue Apr 8, 2024 · 10 comments
Closed

Investigate Solr performance issues (again) #10469

landreev opened this issue Apr 8, 2024 · 10 comments
Assignees
Labels
GREI 3 Search and Browse Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Status: Needs Input Applied to issues in need of input from someone currently unavailable

Comments

@landreev
Copy link
Contributor

landreev commented Apr 8, 2024

Solr-related issues are becoming quite severe in IQSS prod. There's some anecdotal evidence of other instances experiencing this as well, so opening the issue here, in the main repo.

At IQSS, there are general reports from users that searching is becoming increasingly slow. There are also datasets that fail to get indexed, on creation and/or publication. Initial harvests of large remote archives (thousands+ of datasets; something we used to be able to do routinely) no longer work - the records get successfully harvested and imported, but indexing stops after 2-3 K, sometimes hundreds of records. A recently merged PR (#10388) addresses indexing performance by insuring that only a limited number of indexing jobs can be executed in parallel, via semaphores. We will see what effect it has on the performance once 6.2 is deployed. However, I am worried that parallel/simultaneous execution may not be the only, or the main issue. Trying to reindex a huge collection of harvested datasets, even with sleeping for 10 or 15 sec. between calls, it appears that solr becomes overwhelmed 3K or so datasets in. Is this a memory leak? Is this something exponential of the number of datasets in a collection that we have introduced recently? Is this simply a side effect of upgrading to Solr 9* (it appears that the problem has become especially severe since the 6.0 upgrade).

@cmbz
Copy link

cmbz commented Apr 19, 2024

@landreev do you have a size estimate for how much time you'd want to spend on a preliminary investigation? (Understanding, of course, that a deep dive could be a major undertaking)

@landreev landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Apr 25, 2024
@cmbz cmbz moved this to SPRINT READY in IQSS Dataverse Project May 2, 2024
@landreev
Copy link
Contributor Author

landreev commented May 6, 2024

Re: solr logs: I realized that the "slow query" logging was not on in the solr config either (going to turn it back on in a moment).
Otherwise, putting up a snapshot of the currently available logs on demo (?) or someplace else where they are available to the dev. team.

@pdurbin
Copy link
Member

pdurbin commented May 6, 2024

I'm not sure if this is helpful or not but @landreev invited us to put ideas in this issue.

Perhaps we should investigate distributing the load with SolrCloud. As I mentioned in Slack, a contributor tried to add it 8 years ago in #2985 but we weren't ready to think about it.

Fast forward to today and we had a recent conversation with Chris Tate from Red Hat who talked quite a bit about how much he recommends SolrCloud in the 2024-03-21 Containerization Working Group meeting (notes, recording).

@landreev landreev moved this from SPRINT READY to In Progress 💻 in IQSS Dataverse Project May 8, 2024
@landreev landreev self-assigned this May 8, 2024
@cmbz cmbz added Size: 80 A percentage of a sprint. 56 hours. and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels May 8, 2024
@johannes-darms
Copy link
Contributor

+1 for a shift to solrCloud!

@cmbz cmbz added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed Size: 80 A percentage of a sprint. 56 hours. labels May 22, 2024
@landreev
Copy link
Contributor Author

This is the definition of "done" for this effort/this phase of it:

  1. Merge the solr-related PRs currently in the works:
  1. Build a deployable prod. patch; ideally with all of the above, but, at a minimum, 10555, for starters.
  2. Deploy the above in prod. Debating if we should do this when the prod. solr instance is upgraded to RedHat 8 by LTS (which will require some search engine downtime), tentatively scheduled for June 14. May also use the opportunity to force a full reindex. [Maybe worth deploying something prior to that - 10555 specifically?]
  3. New issues opened for all the known problems and/or potential optimizations not covered in the PRs above [pending].

@pdurbin pdurbin added the Status: Needs Input Applied to issues in need of input from someone currently unavailable label May 29, 2024
@qqmyers
Copy link
Member

qqmyers commented Jun 13, 2024

Some rambling: FWIW: In creating #10579, I've discovered that Dataverse is creating a datafile_id_draft_permission doc in the case when a file is published and it's file metadata hasn't changed, in which case the code at

avoids creating a datafile_id_draft doc. That produces an orphan permission doc. Changes in #10579 now allow the status and clear-orphan calls to find/delete these, but I decide it was out of scope to dig into the permission doc creation code to find a way to suppress creating that doc.

In looking through the code, it's also clear that the code checked linked above misses many cases where we shouldn't have to update the doc, e.g. when a file only exists in a draft and the dataset is updated, or when a new version is published (although the datasetVersionId would change in this case). Refactoring to use the last modified date for a file to indicate when it (it's file metadata) last changed, versus when the dataset was last modified, might provide a better way to detect when reindexing a file is required compared to looking for filemetadata changes between versions. Atomic updates could also be a big improvement - for full text indexing for sure, and also in combination with a date-based check (so we could just update the datasetVersionId for example). For this we might need to know that the doc already exists, but we're already doing the necessary query (used now to find which docs aren't needed, but easy to keep track of which docs exist that will be needed).

@landreev
Copy link
Contributor Author

I'll need to re-read the above to fully grasp it, but sounds like a good lead/another area where we can optimize things further.
The scenario described would not fully account for #10597, would it? - I'm fairly positive we end up with unnecessary permission docs for datasets as well, not just files.

@qqmyers
Copy link
Member

qqmyers commented Jun 13, 2024

Oh - yes :-) I think there were other orphan dataset /perms docs being created, but #10579 should remove those/no longer generate them. The scenario above is the one I didn't get rid of.

@landreev
Copy link
Contributor Author

(Sorry, I haven't been paying attention to the developments in #10579, but will catch up next week!)

@landreev
Copy link
Contributor Author

Since this is now meeting the definition of "done" for this phase as defined above (with the first patch containing #10547 and #10555 deployed in prod.), going to close it, for accounting purposes etc.
We should open a new parent issue for keeping track of the individual solr-related optimization PRs, and for any further steps in this effort.

@landreev landreev moved this from In Progress 💻 to Done 🧹 in IQSS Dataverse Project Jun 20, 2024
@DS-INRAE DS-INRAE moved this to Done in Recherche Data Gouv Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GREI 3 Search and Browse Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Status: Needs Input Applied to issues in need of input from someone currently unavailable
Projects
Status: Done
Development

No branches or pull requests

5 participants