Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

gaobinlong · 2024-07-31T06:00:20Z

Description

This PR adds a new metadata field _op_type to the IngestDocument, which makes the ingest processor can modify the op_type parameter of an indexing request, this is useful when users change the indexing requests' target from a ordinary index to a data stream, but because data stream only supports setting op_type to create, so the following bulk request will fail with exception only write ops with an op_type of create are allowed in data streams:

PUT _index_template/template_2
{
  "index_patterns": [
    "ds*"
  ],
  "data_stream":{

  },
  "template": {
    "settings": {
      "number_of_replicas": 0
    },
    "mappings": {
    }
  },
  "priority": 500
}

PUT _data_stream/ds1

PUT /ds1/_bulk?refresh
{"index":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

, users have to change the op_type to create in the request body:

PUT /ds1/_bulk?refresh
{"create":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

, and index API also has this issue.

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code, the usage is:

PUT _ingest/pipeline/set_processor
{
  "processors": [
      {
        "set": {
          "field": "_op_type",
          "value": "create"
        }
      }
    ]
}
PUT ds1/_settings
{
  "index.default_pipeline":"set_processor"
}

PUT /ds1/_bulk?refresh

{"index":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

Related Issues

Resolves #2856.

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ng request Signed-off-by: Gao Binlong <gbinlong@amazon.com>

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-07-31T06:23:59Z

❌ Gradle check result for 4025a02: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-07-31T06:48:19Z

❌ Gradle check result for 8f9f91a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-07-31T10:36:56Z

❕ Gradle check result for dde3be1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2024-07-31T10:47:53Z

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.81%. Comparing base (a918530) to head (e9c8b32).
Report is 393 commits behind head on main.

Files with missing lines	Patch %	Lines
...main/java/org/opensearch/ingest/IngestService.java	80.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #15031      +/-   ##
============================================
- Coverage     71.84%   71.81%   -0.04%     
+ Complexity    62911    62897      -14     
============================================
  Files          5176     5176              
  Lines        295133   295149      +16     
  Branches      42676    42680       +4     
============================================
- Hits         212029   211951      -78     
- Misses        65709    65754      +45     
- Partials      17395    17444      +49

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

owaiskazi19

Looks good overall with minor suggestions

...les/ingest-common/src/yamlRestTest/resources/rest-api-spec/test/ingest/270_set_processor.yml

owaiskazi19 · 2024-08-06T17:01:51Z

@andrross another look?

owaiskazi19 · 2024-08-06T22:11:01Z

@gaobinlong can you resolve the conflicts?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

github-actions · 2024-08-07T02:55:37Z

❌ Gradle check result for a8bd353: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-08-07T03:51:04Z

❕ Gradle check result for a8bd353: UNSTABLE

TEST FAILURES:

      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testReadRange

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions · 2024-08-07T14:13:11Z

✅ Gradle check result for e9c8b32: SUCCESS

andrross · 2024-08-08T22:29:01Z

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code

@gaobinlong I'm not convinced this is a good idea. create has different semantics than the other operation types. In the general case you can't take a system that is ingesting documents with the index operation type and replace it with create and expect things to work because those operations do different things.

gaobinlong · 2024-08-12T02:43:42Z

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code

@gaobinlong I'm not convinced this is a good idea. create has different semantics than the other operation types. In the general case you can't take a system that is ingesting documents with the index operation type and replace it with create and expect things to work because those operations do different things.

Thanks @andrross, this change doesn't target for general cases, but for the case that users want to write to a data stream but don't want to do any code change or configuration change in the client, for example, if Logstash is used to write to a data stream in OpenSearch, the setting action must be set to create, the example is [here]:(https://opensearch.org/docs/latest/tools/logstash/ship-to-opensearch/):

output {    
    opensearch {        
          hosts  => ["https://hostname:port"]     
          auth_type => {            
              type => 'basic'           
              user => 'admin'           
              password => 'admin'           
          }
          index => "my-data-stream"
          action => "create"
   }            
}

, but if the ingestion tool doesn't support the action parameter, users have no options, so this change provides some flexibility, they can choose to configure the op_type in client side or modify it in the server side. In addition, these fields like _index, _routing, if_seq_no and if_primary_term can be modified by ingest pipeline, so I think it makes sense that we support modifying the op_type parameter during the execution of ingest pipeline.

andrross · 2024-08-16T21:03:49Z

if the ingestion tool doesn't support the action parameter, users have no options

Are there users in this situation asking for this feature?

The reason I'm hesitant is that while it does solve the above use case, it seems like it could let users really shoot themselves in the foot, either subtly (the different semantics of index vs create can cause hard to track down errors in the system) to absurd (always set op_type to delete). @msfroh what do you think?

opensearch-trigger-bot · 2024-09-17T15:22:12Z

This PR is stalled because it has been open for 30 days with no activity.

opensearch-trigger-bot · 2024-10-20T15:22:51Z

This PR is stalled because it has been open for 30 days with no activity.

msfroh · 2024-10-22T20:36:13Z

The reason I'm hesitant is that while it does solve the above use case, it seems like it could let users really shoot themselves in the foot,

In this case, given the other fields that can be overridden (e.g if_seq_no, and if_primary_term), I think users can already shoot themselves in the foot pretty hard by abusing metadata fields. That said, I can see how this might make it even easier.

I'm trying to think of alternative solutions. The obvious ones would be "Update the client to set the action" or "Set up a proxy that sets the action between the client and cluster". Of course, ingest pipelines are like a proxy that happens to run on the cluster. I'm wondering if it would make sense to add a dedicated processor for this task, rather than doing it through the set processor. With a dedicated processor, we could at least document the risks.

andrross · 2024-10-23T19:37:35Z

I'm trying to think of alternative solutions. The obvious ones would be "Update the client to set the action" or "Set up a proxy that sets the action between the client and cluster".

@msfroh I'm in favor of "Update the client to set the action" because index and create op_types have different semantics so you have to be sure your client will continue to behave properly if changing from one type to the other. It's better to do this on the client side versus silently changing it in the server or in a proxy layer. I'm happy to be convinced that I'm being too pedantic about this though. If we do decide to support this then I think I'm in favor of doing it as implemented in this PR versus creating a dedicated processor type (that seems overkill).

A third option would be to allow data streams to support the index op type but just silently behave as if create was specified. I think this is a bad idea and in my opinion highlights why changing the op type in an ingest processor is also a bad idea.

msfroh · 2024-10-24T03:47:17Z

I think the real problem here may be that IndexRequest defaults to opType = OpType.INDEX:

OpenSearch/server/src/main/java/org/opensearch/action/index/IndexRequest.java

Line 116 in 0d54c16

private OpType opType = OpType.INDEX;

It sounds like what we really need to do is leave it null by default, and then set the default based on the target. On an index, it should be INDEX, while on a data stream, the default should be CREATE. @gaobinlong -- it sounds like you're saying that the real issue is clients that don't set the op_type at all, right?

@andrross, @gaobinlong, what do you think of that option?

gaobinlong · 2024-10-25T03:32:35Z

It sounds like what we really need to do is leave it null by default, and then set the default based on the target. On an index, it should be INDEX, while on a data stream, the default should be CREATE

Thanks @msfroh, your solution already exists, for index API, we don't need to specify op_type, the default op_type is create for data streams, and index for ordinary index:

POST ds11/_doc?refresh&op_type=index
{
  "a":1,
  "@timestamp":"2024-07-08T09:28:48+00:00"
}

Response: 
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "only write ops with an op_type of create are allowed in data streams"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "only write ops with an op_type of create are allowed in data streams"
  },
  "status": 400
}

, for bulk API, op_type needs to be explicitly specified, so only create is allowed for data streams:

## works well
PUT /ds11/_bulk?refresh
{"create":{}}
{ "@timestamp": "2024-08-13T11:04:05.000Z", "foo":"bar" }

## not work
PUT /ds11/_bulk?refresh
{"index":{}}
{ "@timestamp": "2024-08-13T11:04:05.000Z", "foo":"bar" }

Response:
{
  "took": 0,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": "ds11",
        "_id": null,
        "status": 400,
        "error": {
          "type": "illegal_argument_exception",
          "reason": "only write ops with an op_type of create are allowed in data streams"
        }
      }
    }
  ]
}

it sounds like you're saying that the real issue is clients that don't set the op_type at all, right?

@msfroh @andrross , I've checked Logstash, data-prepper, fluent-bit, java-client, all of them can support specifying the op_type parameter for now, but may not in older version before users found this problem: opensearch-project/data-prepper#2038 (comment), and when I'm checking these tools, I found that most of them treat ordinary index and data stream as different thing, op_type must be explicitly set for data streams, this can be optimized by setting the default value of op_type to create for data streams in these tools, but int the server side, maybe we can also provide a workaround for users if they don't want to specify different configuration for data streams and ordinary index in the client side? Just like the user said in the issue: Alternatively I could not use a data stream or change the client (which I don't want to do in this case).

Ingest pipeline supports modifying the op_type parameter of an indexi…

4025a02

…ng request Signed-off-by: Gao Binlong <gbinlong@amazon.com>

gaobinlong requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, dblock, dbwiddis, gbbafna, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, sachinpkale, saratvemulapalli, shwetathareja, sohami and VachaShah as code owners July 31, 2024 06:00

github-actions bot added enhancement Enhancement or improvement to existing feature or request Indexing & Search labels Jul 31, 2024

gaobinlong added the backport 2.x Backport to 2.x branch label Jul 31, 2024

Modify changelog

8f9f91a

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

gaobinlong added 2 commits July 31, 2024 17:48

Fix yaml test failure

dde3be1

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

Revert some change

94c374c

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

owaiskazi19 reviewed Aug 5, 2024

View reviewed changes

...les/ingest-common/src/yamlRestTest/resources/rest-api-spec/test/ingest/270_set_processor.yml Show resolved Hide resolved

owaiskazi19 approved these changes Aug 6, 2024

View reviewed changes

merge main

a8bd353

Signed-off-by: Gao Binlong <gbinlong@amazon.com>

gaobinlong added Indexing Indexing, Bulk Indexing and anything related to indexing and removed Indexing & Search labels Aug 7, 2024

Merge remote-tracking branch 'upstream/main' into op_type

e9c8b32

github-actions bot added the Indexing & Search label Aug 7, 2024

opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Sep 17, 2024

opensearch-trigger-bot bot added the stalled Issues that have stalled label Oct 20, 2024

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

gaobinlong commented Jul 31, 2024 •

edited

Loading

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

codecov bot commented Jul 31, 2024 •

edited

Loading

owaiskazi19 left a comment

owaiskazi19 commented Aug 6, 2024

owaiskazi19 commented Aug 6, 2024

github-actions bot commented Aug 7, 2024

github-actions bot commented Aug 7, 2024

github-actions bot commented Aug 7, 2024

andrross commented Aug 8, 2024

gaobinlong commented Aug 12, 2024

andrross commented Aug 16, 2024

opensearch-trigger-bot bot commented Sep 17, 2024

opensearch-trigger-bot bot commented Oct 20, 2024

msfroh commented Oct 22, 2024

andrross commented Oct 23, 2024

msfroh commented Oct 24, 2024 •

edited

Loading

gaobinlong commented Oct 25, 2024

Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

Are you sure you want to change the base?

Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

Conversation

gaobinlong commented Jul 31, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

github-actions bot commented Jul 31, 2024

codecov bot commented Jul 31, 2024 • edited Loading

Codecov Report

owaiskazi19 left a comment

Choose a reason for hiding this comment

owaiskazi19 commented Aug 6, 2024

owaiskazi19 commented Aug 6, 2024

github-actions bot commented Aug 7, 2024

github-actions bot commented Aug 7, 2024

github-actions bot commented Aug 7, 2024

andrross commented Aug 8, 2024

gaobinlong commented Aug 12, 2024

andrross commented Aug 16, 2024

opensearch-trigger-bot bot commented Sep 17, 2024

opensearch-trigger-bot bot commented Oct 20, 2024

msfroh commented Oct 22, 2024

andrross commented Oct 23, 2024

msfroh commented Oct 24, 2024 • edited Loading

gaobinlong commented Oct 25, 2024

gaobinlong commented Jul 31, 2024 •

edited

Loading

codecov bot commented Jul 31, 2024 •

edited

Loading

msfroh commented Oct 24, 2024 •

edited

Loading