Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bulk] Add _index, _id, status to ERROR object #10015

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

aswath86
Copy link

Description

One of the Bulk API best practices is to reduce the response size using filter_path. AWS OpenSearch document says this,

This response size might seem minimal, but if you index 1,000,000 documents per day—approximately 11.5 documents per second—339 bytes per response works out to 10.17 GB of download traffic per month.

Also, often times, response code for a Bulk request cannot be trusted since document level failures are not known but are only known in the bulk response.

For example, consider the below failed document

{
    "index": {
    "_index": "bulk_response",
    "_id": "2",
    "status": 400,
    "error": {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [field2x] within [_doc] is not allowed"
    }
    }
}

filter_path such as filter_path=items.index.error will give the below, leaving no clue about which document on what index failed.

  {
    "index": {
      "error": {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [field2x] within [_doc] is not allowed"
      }
    }
  }

One cannot reduce the response size as well as capture failed documents. The idea is to add the _index, _id and status to the error object too so it gives us this,

  {
    "index" : {
      "error" : {
        "_index" : "bulk_response",
        "_id" : "3",
        "status" : 400,
        "type" : "strict_dynamic_mapping_exception",
        "reason" : "mapping set to strict, dynamic introduction of [field2x] within [_doc] is not allowed"
      }
    }
  }

_index, _id and status would be repeated for those responses that end in an error. Are we ok with that?

May not be super useful when _id is auto-generated but useful when _id is client-generated

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Aswath <it.aswath@gmail.com>
This is to reduce the bulk response size  with filter_path on items.index.error and capture failed documents
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Compatibility status:

Checks if related components are compatible with change cbc0a90

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Oct 13, 2023
@ticheng-aws
Copy link
Contributor

Hi @aswath86, the PR is stalled. Is this being worked upon? Feel free to reach out to maintainers for further reviews.

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Jan 9, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Feb 12, 2024
@sohami sohami added enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing API Issues with external APIs labels Feb 14, 2024
@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Feb 17, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 24, 2024
@@ -94,6 +94,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field(_ID, failure.getId());
builder.field(STATUS, failure.getStatus().getStatus());
builder.startObject(ERROR);
builder.field(_INDEX, failure.getIndex());
builder.field(_ID, failure.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always generated when the error is passed? What if the error was encountered even before the document id could be generated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the behaviour would be the same as in for builder.field(_ID, failure.getId()); that is above line builder.startObject(ERROR);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I see that users who are not providing filter_path, they will get the _id and _index field twice in case of errors and this adds additional payload by default. Wondering if there is a better way to solve this

Copy link
Member

@mgodwan mgodwan Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if filter_path only contains error, how are users able to determine which document actually failed since successful docs won't return any response, and with auto-generated id, it becomes difficult for clients to know which document failed. (Applicable only for auto generated ids)

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Apr 28, 2024
@aswath86 aswath86 requested a review from ashking94 as a code owner June 28, 2024 14:55
Copy link
Contributor

❌ Gradle check result for 693cea5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jul 1, 2024

❌ Gradle check result for 140d25d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jul 1, 2024

❌ Gradle check result for 620165f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@@ -94,6 +94,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field(_ID, failure.getId());
builder.field(STATUS, failure.getStatus().getStatus());
builder.startObject(ERROR);
builder.field(_INDEX, failure.getIndex());
builder.field(_ID, failure.getId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I see that users who are not providing filter_path, they will get the _id and _index field twice in case of errors and this adds additional payload by default. Wondering if there is a better way to solve this

@@ -96,6 +96,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field(_ID, failure.getId());
builder.field(STATUS, failure.getStatus().getStatus());
builder.startObject(ERROR);
builder.field(_INDEX, failure.getIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this?

@@ -94,6 +94,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field(_ID, failure.getId());
builder.field(STATUS, failure.getStatus().getStatus());
builder.startObject(ERROR);
builder.field(_INDEX, failure.getIndex());
builder.field(_ID, failure.getId());
Copy link
Member

@mgodwan mgodwan Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if filter_path only contains error, how are users able to determine which document actually failed since successful docs won't return any response, and with auto-generated id, it becomes difficult for clients to know which document failed. (Applicable only for auto generated ids)

@@ -96,6 +96,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
builder.field(_ID, failure.getId());
builder.field(STATUS, failure.getStatus().getStatus());
builder.startObject(ERROR);
builder.field(_INDEX, failure.getIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these new fields will add to the b/w usage for existing users who are not using filter path. I would suggest that we rather include this in the error reason in a way that this is backward compatible. You can also consider adding a new field that can be controlled by query parameter similar to what we have in _cat/nodes api where we can control which fields are returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default we should not be increasing the response size and it should be controlled by the user that they need the additional information that you have added here.

@mgodwan
Copy link
Member

mgodwan commented Jul 22, 2024

@aswath86 Are you planning to continue on this change?

Copy link
Contributor

❌ Gradle check result for 0a71128: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Issues with external APIs enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants