-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr: Index all performance is too slow with full production data. #50
Comments
Original Redmine Comment I was convinced that my recursive findPathSegments() method in IndexServiceBean (used to index the "subtree" facet) was the problem but I just commented it out and indexing of 34 dataverses and 554 datasets was not dramatically faster:
I'll have to dig into this some more... |
@pdurbin, is this still an issue? or did those commits fix it? if it isn't an issue, feel free to close this ticket. |
@eaquigley this is still an issue. Or at least it hasn't been confirmed to be fixed. Now that we have API calls to create both dataverses and datasets (and upload files), someone should try to put lots of data into the system and see how long an "index all" takes. It's really a matter of prioritizing this ticket. Anyone who is comfortable with APIs could write a script to load up lots of data. |
On dvn-build it just took 58.6 seconds to index 42 dataverses and 107 datasets. |
I just pushed this to the "final" milestone, since as I mentioned to @eaquigley indexing will get even slower as we start indexing base on permissions in #734. After that, we should look at indexing performance. |
Index performance improvements have been made and this is currently reasonable. Closing this ticket per @kcondon |
@eaquigley @kcondon if you say so. I think it's something like 11 hours to do a full re-index of https://dataverse.harvard.edu as of Dataverse 4.3. We can always open a new issue if we'd like to attempt to make improvements in this area. Also, please note that I reference this issue at http://guides.dataverse.org/en/4.3/installation/administration.html#full-reindex ("Please note that this operation may take hours depending on the amount of data in your system") so we might want to remove that reference from the guides. |
Related: Investigate and fix a memory leak in IndexAll #4463 |
DD-375 Disable editing of the cvoc URL fields
Author Name: Kevin Condon (@kcondon)
Original Redmine Issue: 3457, https://redmine.hmdc.harvard.edu/issues/3457
Original Date: 2014-01-29
Original Assignee: Philip Durbin
Preliminary testing shows index all is taking too long with full production data.
Indexing 1861 dataverses: 41 minutes
Indexing 1900 datasets: 2 hours, 15 minutes. There are 52,000+ datasets.
The above numbers were achieved on dvn-3 with full production data of public dv's and studies. Various glassfish heaps of 512MB and 10GB showed the same performance.
We see"java -server -jar start.jar" at https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
-server? What does that mean?
man java
says this...... and if you follow that link you see this:
"Starting with J2SE 5.0, when an application starts up, the launcher can attempt to detect whether the application is running on a "server-class" machine and, if so, use the Java HotSpot Server Virtual Machine (server VM) instead of the Java HotSpot Client Virtual Machine (client VM). The aim is to improve performance even if no one configures the VM to reflect the application it's running. In general, the server VM starts up more slowly than the client VM, but over time runs more quickly."
Maybe this can help performance?
Related issue(s): #623
Redmine related issue(s): 3430, 4062
The text was updated successfully, but these errors were encountered: