Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest pipeline supports modifying the op_type parameter of an indexing request #15031

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

gaobinlong
Copy link
Collaborator

@gaobinlong gaobinlong commented Jul 31, 2024

Description

This PR adds a new metadata field _op_type to the IngestDocument, which makes the ingest processor can modify the op_type parameter of an indexing request, this is useful when users change the indexing requests' target from a ordinary index to a data stream, but because data stream only supports setting op_type to create, so the following bulk request will fail with exception only write ops with an op_type of create are allowed in data streams:

PUT _index_template/template_2
{
  "index_patterns": [
    "ds*"
  ],
  "data_stream":{

  },
  "template": {
    "settings": {
      "number_of_replicas": 0
    },
    "mappings": {
    }
  },
  "priority": 500
}

PUT _data_stream/ds1

PUT /ds1/_bulk?refresh
{"index":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

, users have to change the op_type to create in the request body:

PUT /ds1/_bulk?refresh
{"create":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

, and index API also has this issue.

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code, the usage is:

PUT _ingest/pipeline/set_processor
{
  "processors": [
      {
        "set": {
          "field": "_op_type",
          "value": "create"
        }
      }
    ]
}
PUT ds1/_settings
{
  "index.default_pipeline":"set_processor"
}

PUT /ds1/_bulk?refresh

{"index":{ }}
{ "@timestamp": "2024-03-08T11:04:05.000Z", "foo":"bar" }

Related Issues

Resolves #2856.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ng request

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❌ Gradle check result for 4025a02: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 8f9f91a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

❕ Gradle check result for dde3be1: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Jul 31, 2024

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.81%. Comparing base (a918530) to head (e9c8b32).
Report is 393 commits behind head on main.

Files with missing lines Patch % Lines
...main/java/org/opensearch/ingest/IngestService.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15031      +/-   ##
============================================
- Coverage     71.84%   71.81%   -0.04%     
+ Complexity    62911    62897      -14     
============================================
  Files          5176     5176              
  Lines        295133   295149      +16     
  Branches      42676    42680       +4     
============================================
- Hits         212029   211951      -78     
- Misses        65709    65754      +45     
- Partials      17395    17444      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall with minor suggestions

@owaiskazi19
Copy link
Member

@andrross another look?

@owaiskazi19
Copy link
Member

@gaobinlong can you resolve the conflicts?

Signed-off-by: Gao Binlong <gbinlong@amazon.com>
Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for a8bd353: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@gaobinlong gaobinlong added Indexing Indexing, Bulk Indexing and anything related to indexing and removed Indexing & Search labels Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

❕ Gradle check result for a8bd353: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testReadRange

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

github-actions bot commented Aug 7, 2024

✅ Gradle check result for e9c8b32: SUCCESS

@andrross
Copy link
Member

andrross commented Aug 8, 2024

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code

@gaobinlong I'm not convinced this is a good idea. create has different semantics than the other operation types. In the general case you can't take a system that is ingesting documents with the index operation type and replace it with create and expect things to work because those operations do different things.

@gaobinlong
Copy link
Collaborator Author

So this PR gives users an option that they can setup an ingest pipeline with modifying the op_type parameter to create to avoid changing the client code

@gaobinlong I'm not convinced this is a good idea. create has different semantics than the other operation types. In the general case you can't take a system that is ingesting documents with the index operation type and replace it with create and expect things to work because those operations do different things.

Thanks @andrross, this change doesn't target for general cases, but for the case that users want to write to a data stream but don't want to do any code change or configuration change in the client, for example, if Logstash is used to write to a data stream in OpenSearch, the setting action must be set to create, the example is [here]:(https://opensearch.org/docs/latest/tools/logstash/ship-to-opensearch/):

output {    
    opensearch {        
          hosts  => ["https://hostname:port"]     
          auth_type => {            
              type => 'basic'           
              user => 'admin'           
              password => 'admin'           
          }
          index => "my-data-stream"
          action => "create"
   }            
}          

, but if the ingestion tool doesn't support the action parameter, users have no options, so this change provides some flexibility, they can choose to configure the op_type in client side or modify it in the server side. In addition, these fields like _index, _routing, if_seq_no and if_primary_term can be modified by ingest pipeline, so I think it makes sense that we support modifying the op_type parameter during the execution of ingest pipeline.

@andrross
Copy link
Member

if the ingestion tool doesn't support the action parameter, users have no options

Are there users in this situation asking for this feature?

The reason I'm hesitant is that while it does solve the above use case, it seems like it could let users really shoot themselves in the foot, either subtly (the different semantics of index vs create can cause hard to track down errors in the system) to absurd (always set op_type to delete). @msfroh what do you think?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Sep 17, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Oct 20, 2024
@msfroh
Copy link
Collaborator

msfroh commented Oct 22, 2024

The reason I'm hesitant is that while it does solve the above use case, it seems like it could let users really shoot themselves in the foot,

In this case, given the other fields that can be overridden (e.g if_seq_no, and if_primary_term), I think users can already shoot themselves in the foot pretty hard by abusing metadata fields. That said, I can see how this might make it even easier.

I'm trying to think of alternative solutions. The obvious ones would be "Update the client to set the action" or "Set up a proxy that sets the action between the client and cluster". Of course, ingest pipelines are like a proxy that happens to run on the cluster. I'm wondering if it would make sense to add a dedicated processor for this task, rather than doing it through the set processor. With a dedicated processor, we could at least document the risks.

@andrross
Copy link
Member

I'm trying to think of alternative solutions. The obvious ones would be "Update the client to set the action" or "Set up a proxy that sets the action between the client and cluster".

@msfroh I'm in favor of "Update the client to set the action" because index and create op_types have different semantics so you have to be sure your client will continue to behave properly if changing from one type to the other. It's better to do this on the client side versus silently changing it in the server or in a proxy layer. I'm happy to be convinced that I'm being too pedantic about this though. If we do decide to support this then I think I'm in favor of doing it as implemented in this PR versus creating a dedicated processor type (that seems overkill).

A third option would be to allow data streams to support the index op type but just silently behave as if create was specified. I think this is a bad idea and in my opinion highlights why changing the op type in an ingest processor is also a bad idea.

@msfroh
Copy link
Collaborator

msfroh commented Oct 24, 2024

I think the real problem here may be that IndexRequest defaults to opType = OpType.INDEX:

It sounds like what we really need to do is leave it null by default, and then set the default based on the target. On an index, it should be INDEX, while on a data stream, the default should be CREATE. @gaobinlong -- it sounds like you're saying that the real issue is clients that don't set the op_type at all, right?

@andrross, @gaobinlong, what do you think of that option?

@gaobinlong
Copy link
Collaborator Author

It sounds like what we really need to do is leave it null by default, and then set the default based on the target. On an index, it should be INDEX, while on a data stream, the default should be CREATE

Thanks @msfroh, your solution already exists, for index API, we don't need to specify op_type, the default op_type is create for data streams, and index for ordinary index:

POST ds11/_doc?refresh&op_type=index
{
  "a":1,
  "@timestamp":"2024-07-08T09:28:48+00:00"
}

Response: 
{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "only write ops with an op_type of create are allowed in data streams"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "only write ops with an op_type of create are allowed in data streams"
  },
  "status": 400
}

, for bulk API, op_type needs to be explicitly specified, so only create is allowed for data streams:

## works well
PUT /ds11/_bulk?refresh
{"create":{}}
{ "@timestamp": "2024-08-13T11:04:05.000Z", "foo":"bar" }

## not work
PUT /ds11/_bulk?refresh
{"index":{}}
{ "@timestamp": "2024-08-13T11:04:05.000Z", "foo":"bar" }

Response:
{
  "took": 0,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": "ds11",
        "_id": null,
        "status": 400,
        "error": {
          "type": "illegal_argument_exception",
          "reason": "only write ops with an op_type of create are allowed in data streams"
        }
      }
    }
  ]
}

it sounds like you're saying that the real issue is clients that don't set the op_type at all, right?

@msfroh @andrross , I've checked Logstash, data-prepper, fluent-bit, java-client, all of them can support specifying the op_type parameter for now, but may not in older version before users found this problem: opensearch-project/data-prepper#2038 (comment), and when I'm checking these tools, I found that most of them treat ordinary index and data stream as different thing, op_type must be explicitly set for data streams, this can be optimized by setting the default value of op_type to create for data streams in these tools, but int the server side, maybe we can also provide a workaround for users if they don't want to specify different configuration for data streams and ordinary index in the client side? Just like the user said in the issue: Alternatively I could not use a data stream or change the client (which I don't want to do in this case).

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch enhancement Enhancement or improvement to existing feature or request Indexing & Search Indexing Indexing, Bulk Indexing and anything related to indexing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Changing op_type in ingest pipeline in case of _bulk operation
4 participants