... More Solr improvements #454

nickumia-reisys · 2022-04-27T14:35:16Z

These changes cover our most recent, most successful test case for SolrCloud which implements the following changes,

Zookeeper timeout options
- Theory: this is the timeout between Solr and Zookeeper, if a Solr node times out with ZK, it gets out of sync and ZK can't communicate with it too well after that.
- Action: Increase from 30,000 to 600,000. (Borrowed from the other team, no real derivation).
- Allow more tolerance for ZK latency GSA-TTS/datagov-brokerpak-solr#35
- Update solrcloud brokerpak v1.0.7 to v1.0.8 datagov-ssb#138
Solr Java Memory/Garbage Collection tuning
- Theory: If there is a very expensive operation happening in Solr that causes it's memory to spike/crash alongside with garbage collection cleanup, Solr goes down and if CKAN keeps trying to do that with all of the nodes and then there's no leader because of this, it causes the collection to crash
- Action: Increase Memory from 14GB to 32GB (Note: I didn't do anything with changing the type of garbage collection because it seems like the most optimal GC is implemented already)
- Implemented in this PR
Latency within AWS Availability Zones that causes sync issues between Solr/Zookeeper
- Theory: Solr and ZK require a very very low latency connection to stay in sync and for ZK to do proper node management. The guidance is to not run ZK across multiple AZs.
- Action: Deploy managed node group to just one availability zone
- Single AZ Support GSA-TTS/datagov-brokerpak-eks#93
Solr Performance Specs for our Larger Document Sizes
- Theory: Based on this guidance, there's different values for autoCommit and autoSoftCommit based on Larger numbers of small documents vs. Very Large Documents.
- Action: I decreased this to help keep Java Memory utilization low for our larger documents.
- Implemented in this PR
Disable Solr Restarts
- Theory: Performing a Solr restart while index writes are occurring causes problems when the nodes start coming back up again. Read queries are fine to be occurring during restarts. It's just if index writes aren't coordinated with ZK properly during leader transitions or transaction logging, it causes the collection to become unresponsive.
- Action: "Disable" it by setting the next restart to be a very long time in the future (Jan 1st, 2027).
- Implemented in this PR
- Despite the following documentation, Solr does not handle restart recoveries for our data well,

Things that we should do (but are not covered in this PR):

Solr Performance with NewRelic
- We've talked about this before, but if we really want to inspect Solr, we'll probably have to do the work to get this set up.

There isn't a good CRON string to 'disable' restarts entirely. What this is says is restart on Jan 1 at 12:00AM whenever Jan 1 is a Friday. The next occurence of that is Jan 1, 2027

This isn't an exact science and there are a lot of variables. The general idea is to commit documents to disk so that Java Memory can be more appropriately managed by Solr and not run out of it. If an solr index is going at 10 docs/sec, it will index 50 docs in 5 secs, if it's only going at 5 docs/sec, it will index 25 docs in 5 secs. Therefore, there isn't a good method for identifying how many documents will be indexed within a timeframe nor is it guaranteed to know the size of those documents. For our CKAN application, I'm more okay with this number, but it can be debated and retheorized.

nickumia-reisys · 2022-04-27T14:42:04Z

Also note that we need an EC2 with proper CPU/RAM to support this. The current EC2 type is c5.9xlarge

nickumia-reisys added 2 commits April 27, 2022 09:53

new: delay solr restarts for a long time

a5f4a3d

There isn't a good CRON string to 'disable' restarts entirely. What this is says is restart on Jan 1 at 12:00AM whenever Jan 1 is a Friday. The next occurence of that is Jan 1, 2027

nickumia-reisys changed the title ~~Solr improvements~~ ... More Solr improvements Apr 27, 2022

new: scale up, not out

b8cfeac

nickumia-reisys requested a review from a team April 27, 2022 15:17

mogul approved these changes Apr 27, 2022

View reviewed changes

mogul merged commit a762a8c into main Apr 27, 2022

mogul deleted the solr-improvements branch April 27, 2022 15:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

... More Solr improvements #454

... More Solr improvements #454

nickumia-reisys commented Apr 27, 2022

nickumia-reisys commented Apr 27, 2022

... More Solr improvements #454

... More Solr improvements #454

Conversation

nickumia-reisys commented Apr 27, 2022

nickumia-reisys commented Apr 27, 2022