Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

... More Solr improvements #454

Merged
merged 3 commits into from
Apr 27, 2022
Merged

... More Solr improvements #454

merged 3 commits into from
Apr 27, 2022

Conversation

nickumia-reisys
Copy link
Contributor

These changes cover our most recent, most successful test case for SolrCloud which implements the following changes,

  • Zookeeper timeout options
  • Solr Java Memory/Garbage Collection tuning
    • Theory: If there is a very expensive operation happening in Solr that causes it's memory to spike/crash alongside with garbage collection cleanup, Solr goes down and if CKAN keeps trying to do that with all of the nodes and then there's no leader because of this, it causes the collection to crash
    • Action: Increase Memory from 14GB to 32GB (Note: I didn't do anything with changing the type of garbage collection because it seems like the most optimal GC is implemented already)
    • Implemented in this PR
  • Latency within AWS Availability Zones that causes sync issues between Solr/Zookeeper
    • Theory: Solr and ZK require a very very low latency connection to stay in sync and for ZK to do proper node management. The guidance is to not run ZK across multiple AZs.
    • Action: Deploy managed node group to just one availability zone
    • Single AZ Support GSA-TTS/datagov-brokerpak-eks#93
  • Solr Performance Specs for our Larger Document Sizes
    • Theory: Based on this guidance, there's different values for autoCommit and autoSoftCommit based on Larger numbers of small documents vs. Very Large Documents.
    • Action: I decreased this to help keep Java Memory utilization low for our larger documents.
    • Implemented in this PR
  • Disable Solr Restarts

Things that we should do (but are not covered in this PR):

  • Solr Performance with NewRelic
    • We've talked about this before, but if we really want to inspect Solr, we'll probably have to do the work to get this set up.

There isn't a good CRON string to 'disable' restarts entirely.  What this is says is restart on Jan 1 at 12:00AM whenever Jan 1 is a Friday.  The next occurence of that is Jan 1, 2027
This isn't an exact science and there are a lot of variables.  The general idea is to commit documents to disk so that Java Memory can be more appropriately managed by Solr and not run out of it.  If an solr index is going at 10 docs/sec, it will index 50 docs in 5 secs, if it's only going at 5 docs/sec, it will index 25 docs in 5 secs.  Therefore, there isn't a good method for identifying how many documents will be indexed within a timeframe nor is it guaranteed to know the size of those documents.  For our CKAN application, I'm more okay with this number, but it can be debated and retheorized.
@nickumia-reisys nickumia-reisys changed the title Solr improvements ... More Solr improvements Apr 27, 2022
@nickumia-reisys
Copy link
Contributor Author

Also note that we need an EC2 with proper CPU/RAM to support this. The current EC2 type is c5.9xlarge

@nickumia-reisys nickumia-reisys requested a review from a team April 27, 2022 15:17
@mogul mogul merged commit a762a8c into main Apr 27, 2022
@mogul mogul deleted the solr-improvements branch April 27, 2022 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants