Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support batch ingestion in bulk API (#12457) #13306

Merged
merged 18 commits into from
Apr 30, 2024
Merged

Conversation

chishui
Copy link
Contributor

@chishui chishui commented Apr 19, 2024

Description

This PR is to enable batch ingestion in _bulk API. Please refer to RFC for proposal and discussion. It includes three major changes:

  1. Add two parameters to _bulk API
    1. batch_ingestion_option: it has two options: disabled (default option) and enabled
    2. maximum_batch_size: batch size. If there are 100 documents in a _bulk API for ingest, and maximum_batch_size is set to 20, then, there will be 5 batches in total with 20 documents in each batch.
  2. Support batchExecute in Processor, Pipeline, and CompoundProcessor, so that they can process documents in batches.
  3. If user enables batch ingestion, then in IngestService, documents are processed in batch flow and batchExecute of Pipeline will be called.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]
#12457

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for e404b2c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 6dd9e4d:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 3c7d516: SUCCESS

Copy link

codecov bot commented Apr 24, 2024

Codecov Report

Attention: Patch coverage is 83.23353% with 56 lines in your changes are missing coverage. Please review.

Project coverage is 71.45%. Comparing base (b15cb0c) to head (3cc7f41).
Report is 241 commits behind head on main.

Files Patch % Lines
...main/java/org/opensearch/ingest/IngestService.java 79.81% 29 Missing and 14 partials ⚠️
...n/java/org/opensearch/action/bulk/BulkRequest.java 40.00% 5 Missing and 1 partial ⚠️
.../java/org/opensearch/ingest/CompoundProcessor.java 93.75% 0 Missing and 4 partials ⚠️
...a/org/opensearch/ingest/IngestDocumentWrapper.java 72.72% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13306      +/-   ##
============================================
+ Coverage     71.42%   71.45%   +0.03%     
- Complexity    59978    60862     +884     
============================================
  Files          4985     5046      +61     
  Lines        282275   286403    +4128     
  Branches      40946    41489     +543     
============================================
+ Hits         201603   204640    +3037     
- Misses        63999    64824     +825     
- Partials      16673    16939     +266     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

✅ Gradle check result for cb1ba09: SUCCESS

Copy link
Contributor

✅ Gradle check result for 8d302cc: SUCCESS

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a code perspective. I would still like @reta and/or @navneet1v to evaluate this and whether it addresses performance concerns raised on the RFC discussion.

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can be in a future PR)

Handle and add tests for batch_size = 0 and -1.

Copy link
Contributor

❕ Gradle check result for e2fb585: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testOverriddenBufferInterval

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Copy link
Contributor

❌ Gradle check result for 334e0c1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@dblock
Copy link
Member

dblock commented Apr 29, 2024

❌ Gradle check result for 334e0c1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testCreateSplitIndex
org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.testSplitFromOneToN
org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.classMethod

Looks like setup timed out, not sure it's a flake.

chishui and others added 2 commits April 30, 2024 07:57
Co-authored-by: Andriy Redko <drreta@gmail.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>
Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Copy link
Contributor

❌ Gradle check result for 3cc7f41: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@chishui
Copy link
Contributor Author

chishui commented Apr 30, 2024

Copy link
Contributor

❌ Gradle check result for 68cabe1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 3cc7f41: SUCCESS

@chishui
Copy link
Contributor Author

chishui commented Apr 30, 2024

gradle check passed, @dblock / @dbwiddis could you help merge?

@dblock dblock merged commit 1219c56 into opensearch-project:main Apr 30, 2024
28 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-13306-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1219c568248fafa479d67a1eaa6e3e2d9748701e
# Push it to GitHub
git push --set-upstream origin backport/backport-13306-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-13306-to-2.x.

@dblock
Copy link
Member

dblock commented Apr 30, 2024

@chishui this will need a manual backport pls

dblock pushed a commit that referenced this pull request Apr 30, 2024
…13462)

* Support batch ingestion in bulk API (#12457) (#13306)

* [PoC][issues-12457] Support Batch Ingestion

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rewrite batch interface and handle error and metrics

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove unnecessary change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Revert some unnecessary test change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Keep executeBulkRequest main logic untouched

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT & yamlRest test, fix BulkRequest se/deserialization

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add missing java docs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove Writable from BatchIngestionOption

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more UTs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Fix spotlesscheck

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rename parameter name to batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more rest yaml tests & update rest spec

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove batch_ingestion_option and only use batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Throw invalid request exception for invalid batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Update server/src/main/java/org/opensearch/action/bulk/BulkRequest.java

Co-authored-by: Andriy Redko <drreta@gmail.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>

* Remove version constant

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

---------

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>
Co-authored-by: Andriy Redko <drreta@gmail.com>
(cherry picked from commit 1219c56)

* Adjust changelog item position to trigger CI

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

---------

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
finnegancarroll pushed a commit to finnegancarroll/OpenSearch that referenced this pull request May 10, 2024
…earch-project#13306)

* [PoC][issues-12457] Support Batch Ingestion

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rewrite batch interface and handle error and metrics

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove unnecessary change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Revert some unnecessary test change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Keep executeBulkRequest main logic untouched

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT & yamlRest test, fix BulkRequest se/deserialization

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add missing java docs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove Writable from BatchIngestionOption

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more UTs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Fix spotlesscheck

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rename parameter name to batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more rest yaml tests & update rest spec

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove batch_ingestion_option and only use batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Throw invalid request exception for invalid batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Update server/src/main/java/org/opensearch/action/bulk/BulkRequest.java

Co-authored-by: Andriy Redko <drreta@gmail.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>

* Remove version constant

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

---------

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>
Co-authored-by: Andriy Redko <drreta@gmail.com>
deshsidd pushed a commit to deshsidd/OpenSearch that referenced this pull request May 17, 2024
…earch-project#13306)

* [PoC][issues-12457] Support Batch Ingestion

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rewrite batch interface and handle error and metrics

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove unnecessary change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Revert some unnecessary test change

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Keep executeBulkRequest main logic untouched

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add UT & yamlRest test, fix BulkRequest se/deserialization

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add missing java docs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove Writable from BatchIngestionOption

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more UTs

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Fix spotlesscheck

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Rename parameter name to batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Add more rest yaml tests & update rest spec

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Remove batch_ingestion_option and only use batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Throw invalid request exception for invalid batch_size

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

* Update server/src/main/java/org/opensearch/action/bulk/BulkRequest.java

Co-authored-by: Andriy Redko <drreta@gmail.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>

* Remove version constant

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>

---------

Signed-off-by: Liyun Xiu <xiliyun@amazon.com>
Signed-off-by: Liyun Xiu <chishui2@gmail.com>
Co-authored-by: Andriy Redko <drreta@gmail.com>
@@ -26,6 +26,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
- Add cluster setting to dynamically configure the buckets for filter rewrite optimization. ([#13179](https://github.com/opensearch-project/OpenSearch/pull/13179))
- [Tiered Caching] Add a dynamic setting to disable/enable disk cache. ([#13373](https://github.com/opensearch-project/OpenSearch/pull/13373))
- [Remote Store] Add capability of doing refresh as determined by the translog ([#12992](https://github.com/opensearch-project/OpenSearch/pull/12992))
- [Batch Ingestion] Add `batch_size` to `_bulk` API. ([#12457](https://github.com/opensearch-project/OpenSearch/issues/12457))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't link point to PR instead of github issue link here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously PRs are usually pointed to, but I have no problem (and personally prefer) that the issue be linked in the changelog when a good issue does exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants