-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async indexing after update command #9558
async indexing after update command #9558
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I walked through the logic in getNextToIndex and don't see any problems. That said, comments in the code on how it works would be useful.
The CI failed after the unit tests and before the integration tests ran. @donsizemore - any ideas:
|
I have added some comments to the code. I hope it made it more readable. Also, I hope that this change did not break any integration tests. I do not see the results of the tests, so let me know if something breaks. Thanks! |
Thanks. FWIW, it looks like build issues rather than the PR contents are what is causing the failures right now. Once that's fixed we can rerun and let you know if there are any IT failures. |
@qqmyers , @donsizemore
This fixed the Solr starting up. I hope it helps. Last time I had integration tests working was when I played with payara 6, I had to change quite some things to make it work (#9426). |
I have added procps to my docker-aio image, and the tests run again. However, they're fail for this branch with missing root dataverse, while they did run almost successfully on the develop brunch (only two tests failed). Therefore, it is most likely that my changes cause the build to fail, I will investigate this in more detail. |
@ErykKul - Great that you can investigate. FWIW: I think the failure to find root has been seen before - possibly a timing issue that you change to flush-mode is making more likely? Also, FWIW: I cherry-picked your commit to QDR and the call to create a role assignment works fine there (on a pre-existing database). |
It was a timing issue: the on success method is in the same transaction as the creating of root dataverse (submit method), where on success tries to index the newly created dv. Since the submit method does not call flush (the specific create dataverse command implementation), and the transaction was not yet committed (and thus changes are not flushed), the newly created root dataverse does not yet exist and cannot be found by the index bean that needs the root dataverse on initialization... I added a line of code in the command engine that flushes before attempting on success, this should fix this and similar timing issues. I will continue the testing tomorrow. |
FYI: It looks like all of the unit tests are failing with a TransactionRequired exception. The failure on the unit tests job is related to the Run docker/login-action@v2 - Error: Username and password required, but all the unit tests appear to pass in the CI build. |
Yes, please don't worry about the Docker stuff. @poikilotherm and I will fix eventually. 😅 |
Hear hear! 😬🙈 |
…ally before atempting indexing of the newly created dataverse
The transaction required thing was because of the fush in Ejb Dataverse Engine that I added, I moved it to the specific Create Dataverse Command and the test look way better now:
The two errors are exactly the same as I was getting on the develop branch, so I will ignore them for now. I will investigate the 7 failing tests in more detail. All units test passed on my machine, so the failing on git looks incorrect. I think the test running on github has no internet access to pull the xsd schema: https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd |
Good news: I have found what was causing the locking with flash-mode auto: the indexing updates the dataset while indexing it by setting last indexed time, that is why updating the dataset while the indexing was busy was not possible (deadlock situation). I have fixed the issue by making the update the last thing after indexing and making it asynchronous in a new transaction. It does not lock anymore, so I have reverted the flush-mode setting (back to default "auto"). This should fix some timing issues with integration tests. Also, I will retest if asynchronous indexing is still necessary. If it still is, I will investigate how to improve the performance of the indexing itself. This may take some time, I think it will be a different pull request, if I find anything. |
With my last commits, the CPU went nicely down in my tests. In my test scenario, indexing of large dataset takes now around 10 seconds instead of more than a minute. Further improvement of the indexing speed would require smart regeneration of the solar docs for files; only regenerating files that have changes would improve the performance. It looks like this version can handle large datasets (containing thousands of files) very nicely. I will rerun the integration tests to see if there are still problems. |
Most tests pass now. Some that are failing look not related to the changes I made. Some got fixed by adding the "wait for search", since the indexing happens in the background now. It looks safe to me; I will patch our test environment and do some more testing tomorrow. If everything goes well, we will consider patching production (this one is urgent for us since we have a researcher that is unable to upload a large dataset). |
I'll look more later but FWIW: the api.UtilIT.sleepForReindex method might help with tests messed up by timing. That call waits until the lastindexedtime is updated. |
BTW, I grabbed this PR for review earlier today, without realizing that there had been an ongoing back-and-forth since April. @qqmyers, @poikilotherm if either of you are ready to approve it, please go ahead and don't wait for me. Otherwise I'm reading it with great interest. |
P.S. I mean, I see that it was already approved back in April, but a lot has been done since then; and I'm assuming that's why it was moved back into "In Review". |
I was just waiting on the commits from June 1 that resolved some minor requests and I see there are more commits from June 9th. Assuming those are all OK, I was going to approve and move it back. |
@landreev @qqmyers |
OK - looks good! No more changes unless requested in QA! |
FWIW: It looks like the CI failure is the problem with the availability of payara-5.2022.3.zip that was discussed in #dv-tech today and apparently fixed. @kcondon - you may want to rerun to verify that the build now works before you dig in. |
Yes, fixed in gdcc/dataverse-ansible@b3a99b7 I just kicked off a new build: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9558/60/ |
Quick test so far shows adding 1 or more files to a dataset containing 1000 files is closer to adding a single file in performance: 6.71s for adding an additional file to 1000 ds before this pr, 2.94s to add a file with this pr. Adding 3 files to the 1000ds with this pr took 3.44s. Note these are rough times. Can repeat with a larger dataset. Index all on test instance, before this pr with develop and using this pr: with this pr: |
What this PR does / why we need it:
It improves the performance of updating datasets containing huge number of files (thousands of files). With this PR, the updates take seconds i.s.o. minutes (roughly the same as updates of dataset containing small number of files).
Which issue(s) this PR closes:
Closes #9557
Special notes for your reviewer:
Making indexing async was not sufficient to fix this problem, as the default flush mode is auto (on query): indexing in the background causes table locks when attempting another update operation while index is busy. Setting the flush behavior to "commit" solves this.
Also, async indexing starts immediately after an update, but when multiple updates happen to the same dataset while indexing is busy, only the last updated dataset object will get reindexed (we skip unnecessary indexing).
Normally, I would expect that the "commit" flush mode should not impact the functionality. For example, all commands executed on the internal engine are done on transactions, with commit at the end. All indexing, ingesting, post-processing, etc. happen after the transaction is committed, e.g., while executing the "onSuccess" method.
Also, not waiting for indexing to be ready before the operation returns should not impact the functionality neither since the index must be processed by Solr first: immediately querying Solr after the Java index operation is ready my result in a response not reflecting the last changes that were indexed. The difference is that it can take longer time now to see the results on Solr, and thus it can become more noticeable.
Suggestions on how to test this:
Upload thousands of files to a dataset and try several update operations. For example, uploading some files directly after each other.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Possibly. Short mention of the performance improvement should be sufficient.