async indexing after update command #9558

ErykKul · 2023-04-25T10:08:07Z

What this PR does / why we need it:
It improves the performance of updating datasets containing huge number of files (thousands of files). With this PR, the updates take seconds i.s.o. minutes (roughly the same as updates of dataset containing small number of files).

Which issue(s) this PR closes:

Closes #9557

Special notes for your reviewer:
Making indexing async was not sufficient to fix this problem, as the default flush mode is auto (on query): indexing in the background causes table locks when attempting another update operation while index is busy. Setting the flush behavior to "commit" solves this.

Also, async indexing starts immediately after an update, but when multiple updates happen to the same dataset while indexing is busy, only the last updated dataset object will get reindexed (we skip unnecessary indexing).

Normally, I would expect that the "commit" flush mode should not impact the functionality. For example, all commands executed on the internal engine are done on transactions, with commit at the end. All indexing, ingesting, post-processing, etc. happen after the transaction is committed, e.g., while executing the "onSuccess" method.

Also, not waiting for indexing to be ready before the operation returns should not impact the functionality neither since the index must be processed by Solr first: immediately querying Solr after the Java index operation is ready my result in a response not reflecting the last changes that were indexed. The difference is that it can take longer time now to see the results on Solr, and thus it can become more noticeable.

Suggestions on how to test this:
Upload thousands of files to a dataset and try several update operations. For example, uploading some files directly after each other.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.

Is there a release notes update needed for this change?:
Possibly. Short mention of the performance improvement should be sufficient.

coveralls · 2023-04-25T10:10:21Z

coverage: 20.419% (+0.09%) from 20.333% when pulling 0fe7238 on ErykKul:9557_dataset_update_api_performance into 2d29201 on IQSS:develop.

qqmyers

Very nice! I walked through the logic in getNextToIndex and don't see any problems. That said, comments in the code on how it works would be useful.

qqmyers · 2023-04-25T13:22:26Z

The CI failed after the unit tests and before the integration tests ran. @donsizemore - any ideas:

at TASK [dataverse : allow authenticated users to add to root] ********************
fatal: [localhost]: FAILED! => {"access_control_allow_headers": "Accept, Content-Type, X-Dataverse-Key, Range", "access_control_allow_methods": "PUT, GET, POST, DELETE, OPTIONS", "access_control_allow_origin": "*", "access_control_expose_headers": "Accept-Ranges, Content-Range, Content-Encoding", "changed": false, "connection": "close", "content_length": "74", "content_type": "application/json;charset=UTF-8", "elapsed": 0, "json": {"message": "Can't find dataverse with identifier='root'", "status": "ERROR"}, "msg": "Status code was 404 and not [200]: HTTP Error 404: Not Found", "redirected": false, "server": "Payara Server  5.2022.3 #badassfish", "status": 404, "url": "http://localhost:8080/api/dataverses/root/assignments?key=86a3f945-f565-4e54-a031-bc2551b98062", "x_frame_options": "SAMEORIGIN", "x_powered_by": "Servlet/4.0 JSP/2.3 (Payara Server  5.2022.3 #badassfish Java/Red Hat, Inc./11)"}

NO MORE HOSTS LEFT *************************************************************

ErykKul · 2023-04-25T15:40:19Z

I have added some comments to the code. I hope it made it more readable. Also, I hope that this change did not break any integration tests. I do not see the results of the tests, so let me know if something breaks. Thanks!

qqmyers · 2023-04-25T16:52:33Z

Thanks. FWIW, it looks like build issues rather than the PR contents are what is causing the failures right now. Once that's fixed we can rerun and let you know if there are any IT failures.

ErykKul · 2023-04-26T11:59:46Z

@qqmyers , @donsizemore
I am not sure if this is the case here, but I had a similar problem with the integration tests when my Solr did not start up. I happened because the base image is "latest" rocky linux (for docker-aio image) and apparently procps is not there by default anymore. I added it in the yum line at the end like this:

RUN yum install -y java-11-openjdk-devel postgresql13-server sudo epel-release unzip curl httpd python2 diffutils procps

This fixed the Solr starting up. I hope it helps.

Last time I had integration tests working was when I played with payara 6, I had to change quite some things to make it work (#9426).

ErykKul · 2023-04-26T13:38:52Z

@qqmyers , @donsizemore

I have added procps to my docker-aio image, and the tests run again. However, they're fail for this branch with missing root dataverse, while they did run almost successfully on the develop brunch (only two tests failed). Therefore, it is most likely that my changes cause the build to fail, I will investigate this in more detail.

qqmyers · 2023-04-26T14:22:46Z

@ErykKul - Great that you can investigate. FWIW: I think the failure to find root has been seen before - possibly a timing issue that you change to flush-mode is making more likely? Also, FWIW: I cherry-picked your commit to QDR and the call to create a role assignment works fine there (on a pre-existing database).

ErykKul · 2023-04-26T15:36:02Z

It was a timing issue: the on success method is in the same transaction as the creating of root dataverse (submit method), where on success tries to index the newly created dv. Since the submit method does not call flush (the specific create dataverse command implementation), and the transaction was not yet committed (and thus changes are not flushed), the newly created root dataverse does not yet exist and cannot be found by the index bean that needs the root dataverse on initialization... I added a line of code in the command engine that flushes before attempting on success, this should fix this and similar timing issues. I will continue the testing tomorrow.

qqmyers · 2023-04-26T16:35:17Z

FYI: It looks like all of the unit tests are failing with a TransactionRequired exception. The failure on the unit tests job is related to the Run docker/login-action@v2 - Error: Username and password required, but all the unit tests appear to pass in the CI build.

pdurbin · 2023-04-26T16:45:31Z

Yes, please don't worry about the Docker stuff. @poikilotherm and I will fix eventually. 😅

poikilotherm · 2023-04-26T16:48:37Z

Hear hear! 😬🙈

…ally before atempting indexing of the newly created dataverse

ErykKul · 2023-04-26T18:25:19Z

The transaction required thing was because of the fush in Ejb Dataverse Engine that I added, I moved it to the specific Create Dataverse Command and the test look way better now:

[ERROR] Failures: 
[ERROR]   ConfirmEmailIT.testConfirm:112 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   DatasetsIT.testCuratePublishedDatasetVersionCommand:2803 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   FilesIT.testAccessFacet:1112 JSON path data.total_count doesn't match.
Expected: <1>
  Actual: 0

[ERROR]   HarvestingServerIT.testMultiRecordOaiSet:606 Wrong number of items on the first ListIdentifiers page expected:<2> but was:<5>
[ERROR]   LinkIT.testDeepLinks:170 JSON path data.total_count doesn't match.
Expected: <1>
  Actual: 0

[ERROR]   UsersIT.convertNonBcryptUserFromBuiltinToShib:314 Expected status code <200> doesn't match actual status code <500>.

[ERROR]   UsersIT.testMergeAccounts:250 expected:<200> but was:<401>
[ERROR] Errors: 
[ERROR]   DatasetsIT.testPrivateUrl:964 » Connect Connection refused (Connection refused...
[ERROR]   SignpostingIT.testSignposting:91 » Connect Connection refused (Connection refu...
[INFO] 
[ERROR] Tests run: 199, Failures: 7, Errors: 2, Skipped: 8

The two errors are exactly the same as I was getting on the develop branch, so I will ignore them for now. I will investigate the 7 failing tests in more detail. All units test passed on my machine, so the failing on git looks incorrect. I think the test running on github has no internet access to pull the xsd schema: https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd

ErykKul · 2023-04-26T21:10:07Z

Good news: I have found what was causing the locking with flash-mode auto: the indexing updates the dataset while indexing it by setting last indexed time, that is why updating the dataset while the indexing was busy was not possible (deadlock situation). I have fixed the issue by making the update the last thing after indexing and making it asynchronous in a new transaction. It does not lock anymore, so I have reverted the flush-mode setting (back to default "auto"). This should fix some timing issues with integration tests.

Also, I will retest if asynchronous indexing is still necessary. If it still is, I will investigate how to improve the performance of the indexing itself. This may take some time, I think it will be a different pull request, if I find anything.

ErykKul · 2023-04-27T13:48:33Z

With my last commits, the CPU went nicely down in my tests. In my test scenario, indexing of large dataset takes now around 10 seconds instead of more than a minute. Further improvement of the indexing speed would require smart regeneration of the solar docs for files; only regenerating files that have changes would improve the performance.

It looks like this version can handle large datasets (containing thousands of files) very nicely. I will rerun the integration tests to see if there are still problems.

ErykKul · 2023-04-27T17:50:49Z

Most tests pass now. Some that are failing look not related to the changes I made. Some got fixed by adding the "wait for search", since the indexing happens in the background now. It looks safe to me; I will patch our test environment and do some more testing tomorrow. If everything goes well, we will consider patching production (this one is urgent for us since we have a researcher that is unable to upload a large dataset).

qqmyers · 2023-04-27T21:26:26Z

I'll look more later but FWIW: the api.UtilIT.sleepForReindex method might help with tests messed up by timing. That call waits until the lastindexedtime is updated.

…com/ErykKul/dataverse into 9557_dataset_update_api_performance

landreev · 2023-06-12T19:49:36Z

BTW, I grabbed this PR for review earlier today, without realizing that there had been an ongoing back-and-forth since April. @qqmyers, @poikilotherm if either of you are ready to approve it, please go ahead and don't wait for me. Otherwise I'm reading it with great interest.
Thank you, @ErykKul!

landreev · 2023-06-12T19:54:12Z

P.S. I mean, I see that it was already approved back in April, but a lot has been done since then; and I'm assuming that's why it was moved back into "In Review".

qqmyers · 2023-06-12T20:04:56Z

I was just waiting on the commits from June 1 that resolved some minor requests and I see there are more commits from June 9th. Assuming those are all OK, I was going to approve and move it back.

ErykKul · 2023-06-13T07:05:48Z

@landreev @qqmyers
The changes as requested are all there. The later commits (from June 9th) fixed the integration test issues. I think that all changes are OK now and can be merged. I will be still watching this closely, just in case something pops up (e.g., I am a little bit worried about re-indexing of large Dataverse collections, but normally it should be fine and even work better after this PR).

qqmyers · 2023-06-13T18:23:04Z

OK - looks good! No more changes unless requested in QA!

qqmyers · 2023-06-13T18:28:38Z

FWIW: It looks like the CI failure is the problem with the availability of payara-5.2022.3.zip that was discussed in #dv-tech today and apparently fixed. @kcondon - you may want to rerun to verify that the build now works before you dig in.

pdurbin · 2023-06-13T21:11:58Z

Yes, fixed in gdcc/dataverse-ansible@b3a99b7

I just kicked off a new build: https://jenkins.dataverse.org/job/IQSS-Dataverse-Develop-PR/job/PR-9558/60/

kcondon · 2023-06-14T20:04:51Z

Quick test so far shows adding 1 or more files to a dataset containing 1000 files is closer to adding a single file in performance: 6.71s for adding an additional file to 1000 ds before this pr, 2.94s to add a file with this pr. Adding 3 files to the 1000ds with this pr took 3.44s. Note these are rough times. Can repeat with a larger dataset.

Index all on test instance, before this pr with develop and using this pr:
With develop:
[2023-06-15T17:35:59.722+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.search.IndexBatchServiceBean] [tid: _ThreadID=227 _ThreadName=__ejb-thread-pool16] [timeMillis: 1686850559722] [levelValue: 800] [[
837 dataverses and 1954 datasets indexed. index all took 780740 milliseconds. Solr index was not cleared before indexing.
]]

with this pr:
[2023-06-15T17:59:38.474+0000] [Payara 5.2022.3] [INFO] [] [edu.harvard.iq.dataverse.search.IndexBatchServiceBean] [tid: _ThreadID=219 _ThreadName=__ejb-thread-pool8] [timeMillis: 1686851978474] [levelValue: 800] [[
837 dataverses and 1954 datasets indexed. index all took 473430 milliseconds. Solr index was not cleared before indexing.
]]

async indexing after update command

a2adafc

qqmyers approved these changes Apr 25, 2023

View reviewed changes

added comments to the code

a6ff591

fixed creating dataverse timing issue

956488b

ErykKul added 2 commits April 26, 2023 19:35

flush in dataverse engine caused problems, it is now called specifiic…

31cd292

…ally before atempting indexing of the newly created dataverse

nullpointer fix in unit test

2ab7820

fixed the problem with locking -> reverted flush-mode setting

94e58c1

ErykKul added 4 commits April 27, 2023 12:06

async ingest of uploaded file

33e56d1

async datset indexing is by default now

8c23bf2

removed previously added logger lines for debugging

5006bce

simplified ingest one file method

f71d0a9

ErykKul added 2 commits April 27, 2023 18:56

reverted file ingesting change

e12ac79

added sleep for search to failing integration tests

baebc68

integration tests fixes

1d8b526

ErykKul and others added 3 commits June 9, 2023 17:21

Merge branch 'IQSS:develop' into 9557_dataset_update_api_performance

cc4a15f

restored findDeep in index service bean but only where find was used

4be0b5d

Merge branch '9557_dataset_update_api_performance' of https://github.…

0fe7238

…com/ErykKul/dataverse into 9557_dataset_update_api_performance

pdurbin assigned landreev Jun 12, 2023

Merge branch 'IQSS:develop' into 9557_dataset_update_api_performance

56c4a38

landreev removed their assignment Jun 13, 2023

pdurbin assigned qqmyers Jun 13, 2023

qqmyers removed their assignment Jun 13, 2023

kcondon self-assigned this Jun 14, 2023

kcondon merged commit 49f42ba into IQSS:develop Jun 15, 2023

This was referenced Jun 16, 2023

Filters for files in dataset page are gone #9647

Closed

9647 bugfix facets on files page #9649

Merged

This was referenced Jul 12, 2023

retry upload if it failed? #9696

Open

reduce logging and fix typo #9711

Merged

This was referenced Jul 20, 2023

Performance issues as a side effect of the file lookup optimizations #9725

Closed

9725 files lookup performance #9736

Merged

pdurbin mentioned this pull request Aug 24, 2023

Solr: Index all performance is too slow with full production data. #50

Closed

pdurbin added the Feature: Performance & Stability label Sep 25, 2023

pdurbin mentioned this pull request Dec 13, 2023

MoveIT.testMoveLinkedDataset fails intermittently #10189

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async indexing after update command #9558

async indexing after update command #9558

ErykKul commented Apr 25, 2023

coveralls commented Apr 25, 2023 •

edited

Loading

qqmyers left a comment

qqmyers commented Apr 25, 2023

ErykKul commented Apr 25, 2023

qqmyers commented Apr 25, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 26, 2023

qqmyers commented Apr 26, 2023

ErykKul commented Apr 26, 2023

qqmyers commented Apr 26, 2023

pdurbin commented Apr 26, 2023

poikilotherm commented Apr 26, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 27, 2023

ErykKul commented Apr 27, 2023

qqmyers commented Apr 27, 2023

landreev commented Jun 12, 2023

landreev commented Jun 12, 2023

qqmyers commented Jun 12, 2023

ErykKul commented Jun 13, 2023

qqmyers commented Jun 13, 2023

qqmyers commented Jun 13, 2023

pdurbin commented Jun 13, 2023

kcondon commented Jun 14, 2023 •

edited

Loading

async indexing after update command #9558

async indexing after update command #9558

Conversation

ErykKul commented Apr 25, 2023

coveralls commented Apr 25, 2023 • edited Loading

qqmyers left a comment

Choose a reason for hiding this comment

qqmyers commented Apr 25, 2023

ErykKul commented Apr 25, 2023

qqmyers commented Apr 25, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 26, 2023

qqmyers commented Apr 26, 2023

ErykKul commented Apr 26, 2023

qqmyers commented Apr 26, 2023

pdurbin commented Apr 26, 2023

poikilotherm commented Apr 26, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 26, 2023

ErykKul commented Apr 27, 2023

ErykKul commented Apr 27, 2023

qqmyers commented Apr 27, 2023

landreev commented Jun 12, 2023

landreev commented Jun 12, 2023

qqmyers commented Jun 12, 2023

ErykKul commented Jun 13, 2023

qqmyers commented Jun 13, 2023

qqmyers commented Jun 13, 2023

pdurbin commented Jun 13, 2023

kcondon commented Jun 14, 2023 • edited Loading

coveralls commented Apr 25, 2023 •

edited

Loading

kcondon commented Jun 14, 2023 •

edited

Loading