Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async indexing after update command #9558

Merged
merged 71 commits into from
Jun 15, 2023

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Apr 25, 2023

What this PR does / why we need it:
It improves the performance of updating datasets containing huge number of files (thousands of files). With this PR, the updates take seconds i.s.o. minutes (roughly the same as updates of dataset containing small number of files).

Which issue(s) this PR closes:

Closes #9557

Special notes for your reviewer:
Making indexing async was not sufficient to fix this problem, as the default flush mode is auto (on query): indexing in the background causes table locks when attempting another update operation while index is busy. Setting the flush behavior to "commit" solves this.

Also, async indexing starts immediately after an update, but when multiple updates happen to the same dataset while indexing is busy, only the last updated dataset object will get reindexed (we skip unnecessary indexing).

Normally, I would expect that the "commit" flush mode should not impact the functionality. For example, all commands executed on the internal engine are done on transactions, with commit at the end. All indexing, ingesting, post-processing, etc. happen after the transaction is committed, e.g., while executing the "onSuccess" method.

Also, not waiting for indexing to be ready before the operation returns should not impact the functionality neither since the index must be processed by Solr first: immediately querying Solr after the Java index operation is ready my result in a response not reflecting the last changes that were indexed. The difference is that it can take longer time now to see the results on Solr, and thus it can become more noticeable.

Suggestions on how to test this:
Upload thousands of files to a dataset and try several update operations. For example, uploading some files directly after each other.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.

Is there a release notes update needed for this change?:
Possibly. Short mention of the performance improvement should be sufficient.

@coveralls
Copy link

coveralls commented Apr 25, 2023

Coverage Status

coverage: 20.419% (+0.09%) from 20.333% when pulling 0fe7238 on ErykKul:9557_dataset_update_api_performance into 2d29201 on IQSS:develop.

Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I walked through the logic in getNextToIndex and don't see any problems. That said, comments in the code on how it works would be useful.

@qqmyers
Copy link
Member

qqmyers commented Apr 25, 2023

The CI failed after the unit tests and before the integration tests ran. @donsizemore - any ideas:

at TASK [dataverse : allow authenticated users to add to root] ********************
fatal: [localhost]: FAILED! => {"access_control_allow_headers": "Accept, Content-Type, X-Dataverse-Key, Range", "access_control_allow_methods": "PUT, GET, POST, DELETE, OPTIONS", "access_control_allow_origin": "*", "access_control_expose_headers": "Accept-Ranges, Content-Range, Content-Encoding", "changed": false, "connection": "close", "content_length": "74", "content_type": "application/json;charset=UTF-8", "elapsed": 0, "json": {"message": "Can't find dataverse with identifier='root'", "status": "ERROR"}, "msg": "Status code was 404 and not [200]: HTTP Error 404: Not Found", "redirected": false, "server": "Payara Server  5.2022.3 #badassfish", "status": 404, "url": "http://localhost:8080/api/dataverses/root/assignments?key=86a3f945-f565-4e54-a031-bc2551b98062", "x_frame_options": "SAMEORIGIN", "x_powered_by": "Servlet/4.0 JSP/2.3 (Payara Server  5.2022.3 #badassfish Java/Red Hat, Inc./11)"}

NO MORE HOSTS LEFT *************************************************************

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 25, 2023

I have added some comments to the code. I hope it made it more readable. Also, I hope that this change did not break any integration tests. I do not see the results of the tests, so let me know if something breaks. Thanks!

@qqmyers
Copy link
Member

qqmyers commented Apr 25, 2023

Thanks. FWIW, it looks like build issues rather than the PR contents are what is causing the failures right now. Once that's fixed we can rerun and let you know if there are any IT failures.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 26, 2023

@qqmyers , @donsizemore
I am not sure if this is the case here, but I had a similar problem with the integration tests when my Solr did not start up. I happened because the base image is "latest" rocky linux (for docker-aio image) and apparently procps is not there by default anymore. I added it in the yum line at the end like this:

RUN yum install -y java-11-openjdk-devel postgresql13-server sudo epel-release unzip curl httpd python2 diffutils procps

This fixed the Solr starting up. I hope it helps.

Last time I had integration tests working was when I played with payara 6, I had to change quite some things to make it work (#9426).

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 26, 2023

@qqmyers , @donsizemore

I have added procps to my docker-aio image, and the tests run again. However, they're fail for this branch with missing root dataverse, while they did run almost successfully on the develop brunch (only two tests failed). Therefore, it is most likely that my changes cause the build to fail, I will investigate this in more detail.

@qqmyers
Copy link
Member

qqmyers commented Apr 26, 2023

@ErykKul - Great that you can investigate. FWIW: I think the failure to find root has been seen before - possibly a timing issue that you change to flush-mode is making more likely? Also, FWIW: I cherry-picked your commit to QDR and the call to create a role assignment works fine there (on a pre-existing database).

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 26, 2023

It was a timing issue: the on success method is in the same transaction as the creating of root dataverse (submit method), where on success tries to index the newly created dv. Since the submit method does not call flush (the specific create dataverse command implementation), and the transaction was not yet committed (and thus changes are not flushed), the newly created root dataverse does not yet exist and cannot be found by the index bean that needs the root dataverse on initialization... I added a line of code in the command engine that flushes before attempting on success, this should fix this and similar timing issues. I will continue the testing tomorrow.

@qqmyers
Copy link
Member

qqmyers commented Apr 26, 2023

FYI: It looks like all of the unit tests are failing with a TransactionRequired exception. The failure on the unit tests job is related to the Run docker/login-action@v2 - Error: Username and password required, but all the unit tests appear to pass in the CI build.

@pdurbin
Copy link
Member

pdurbin commented Apr 26, 2023

Yes, please don't worry about the Docker stuff. @poikilotherm and I will fix eventually. 😅

@poikilotherm
Copy link
Contributor

Hear hear! 😬🙈

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 26, 2023

The transaction required thing was because of the fush in Ejb Dataverse Engine that I added, I moved it to the specific Create Dataverse Command and the test look way better now:

[ERROR] Failures: 
[ERROR]   ConfirmEmailIT.testConfirm:112 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   DatasetsIT.testCuratePublishedDatasetVersionCommand:2803 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   FilesIT.testAccessFacet:1112 JSON path data.total_count doesn't match.
Expected: <1>
  Actual: 0

[ERROR]   HarvestingServerIT.testMultiRecordOaiSet:606 Wrong number of items on the first ListIdentifiers page expected:<2> but was:<5>
[ERROR]   LinkIT.testDeepLinks:170 JSON path data.total_count doesn't match.
Expected: <1>
  Actual: 0

[ERROR]   UsersIT.convertNonBcryptUserFromBuiltinToShib:314 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   UsersIT.testMergeAccounts:250 expected:<200> but was:<401>
[ERROR] Errors: 
[ERROR]   DatasetsIT.testPrivateUrl:964 » Connect Connection refused (Connection refused...
[ERROR]   SignpostingIT.testSignposting:91 » Connect Connection refused (Connection refu...
[INFO] 
[ERROR] Tests run: 199, Failures: 7, Errors: 2, Skipped: 8

The two errors are exactly the same as I was getting on the develop branch, so I will ignore them for now. I will investigate the 7 failing tests in more detail. All units test passed on my machine, so the failing on git looks incorrect. I think the test running on github has no internet access to pull the xsd schema: https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 26, 2023

Good news: I have found what was causing the locking with flash-mode auto: the indexing updates the dataset while indexing it by setting last indexed time, that is why updating the dataset while the indexing was busy was not possible (deadlock situation). I have fixed the issue by making the update the last thing after indexing and making it asynchronous in a new transaction. It does not lock anymore, so I have reverted the flush-mode setting (back to default "auto"). This should fix some timing issues with integration tests.

Also, I will retest if asynchronous indexing is still necessary. If it still is, I will investigate how to improve the performance of the indexing itself. This may take some time, I think it will be a different pull request, if I find anything.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 27, 2023

With my last commits, the CPU went nicely down in my tests. In my test scenario, indexing of large dataset takes now around 10 seconds instead of more than a minute. Further improvement of the indexing speed would require smart regeneration of the solar docs for files; only regenerating files that have changes would improve the performance.

It looks like this version can handle large datasets (containing thousands of files) very nicely. I will rerun the integration tests to see if there are still problems.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Apr 27, 2023

Most tests pass now. Some that are failing look not related to the changes I made. Some got fixed by adding the "wait for search", since the indexing happens in the background now. It looks safe to me; I will patch our test environment and do some more testing tomorrow. If everything goes well, we will consider patching production (this one is urgent for us since we have a researcher that is unable to upload a large dataset).

@qqmyers
Copy link
Member

qqmyers commented Apr 27, 2023

I'll look more later but FWIW: the api.UtilIT.sleepForReindex method might help with tests messed up by timing. That call waits until the lastindexedtime is updated.

@landreev
Copy link
Contributor

BTW, I grabbed this PR for review earlier today, without realizing that there had been an ongoing back-and-forth since April. @qqmyers, @poikilotherm if either of you are ready to approve it, please go ahead and don't wait for me. Otherwise I'm reading it with great interest.
Thank you, @ErykKul!

@landreev
Copy link
Contributor

P.S. I mean, I see that it was already approved back in April, but a lot has been done since then; and I'm assuming that's why it was moved back into "In Review".

@qqmyers
Copy link
Member

qqmyers commented Jun 12, 2023

I was just waiting on the commits from June 1 that resolved some minor requests and I see there are more commits from June 9th. Assuming those are all OK, I was going to approve and move it back.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Jun 13, 2023

@landreev @qqmyers
The changes as requested are all there. The later commits (from June 9th) fixed the integration test issues. I think that all changes are OK now and can be merged. I will be still watching this closely, just in case something pops up (e.g., I am a little bit worried about re-indexing of large Dataverse collections, but normally it should be fine and even work better after this PR).

@landreev landreev removed their assignment Jun 13, 2023
@qqmyers
Copy link
Member

qqmyers commented Jun 13, 2023

OK - looks good! No more changes unless requested in QA!

@qqmyers qqmyers removed their assignment Jun 13, 2023
@qqmyers
Copy link
Member

qqmyers commented Jun 13, 2023

FWIW: It looks like the CI failure is the problem with the availability of payara-5.2022.3.zip that was discussed in #dv-tech today and apparently fixed. @kcondon - you may want to rerun to verify that the build now works before you dig in.

@pdurbin
Copy link
Member

pdurbin commented Jun 13, 2023

@kcondon kcondon self-assigned this Jun 14, 2023
@kcondon
Copy link
Contributor

kcondon commented Jun 14, 2023

Quick test so far shows adding 1 or more files to a dataset containing 1000 files is closer to adding a single file in performance: 6.71s for adding an additional file to 1000 ds before this pr, 2.94s to add a file with this pr. Adding 3 files to the 1000ds with this pr took 3.44s. Note these are rough times. Can repeat with a larger dataset.

Index all on test instance, before this pr with develop and using this pr:
With develop:
[2023-06-15T17:35:59.722+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.search.IndexBatchServiceBean] [tid: _ThreadID=227 _ThreadName=__ejb-thread-pool16] [timeMillis: 1686850559722] [levelValue: 800] [[
837 dataverses and 1954 datasets indexed. index all took 780740 milliseconds. Solr index was not cleared before indexing.
]]

with this pr:
[2023-06-15T17:59:38.474+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.search.IndexBatchServiceBean] [tid: _ThreadID=219 _ThreadName=__ejb-thread-pool8] [timeMillis: 1686851978474] [levelValue: 800] [[
837 dataverses and 1954 datasets indexed. index all took 473430 milliseconds. Solr index was not cleared before indexing.
]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Performance & Stability Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance issues when uploading files to a dataset containing thousands of files
7 participants