Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk insert with script behaves incorrectly #48670

Closed
AlexeyRaga opened this issue Oct 30, 2019 · 7 comments · Fixed by #49578
Closed

Bulk insert with script behaves incorrectly #48670

AlexeyRaga opened this issue Oct 30, 2019 · 7 comments · Fixed by #49578
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.

Comments

@AlexeyRaga
Copy link

AlexeyRaga commented Oct 30, 2019

I am using an official ES docker container: docker.elastic.co/elasticsearch/elasticsearch:7.4.1

Elasticsearch version (bin/elasticsearch --version):
Version: 7.4.1, Build: default/docker/fc0eeb6e2c25915d63d871d344e3d0b45ea0ea1e/2019-10-22T17:16:35.176724Z, JVM: 13

I also tried 7.3.2 and 7.2.1, they all experience this issue.

Plugins installed: []

JVM version (java -version):
openjdk version "13" 2019-09-17
OpenJDK Runtime Environment AdoptOpenJDK (build 13+33)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13+33, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
Linux fa18f15fe8f0 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

When using bulk upsert with a "painless" script, first document in a batch seems to be handled incorrectly.

Steps to reproduce:

Perform this bulk insert into an index that doesn't yet contain documents with these ids:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "update" : { "_id" : "1", "_index" : "index3"} }
{ "script" : { "source": "ctx._source.counter += params.param1", "lang" : "painless", "params" : {"param1" : 2}}, "scripted_upsert": true, "upsert" : {"counter" : 1}}
{ "update" : { "_id" : "2", "_index" : "index3"} }
{ "script" : { "source": "ctx._source.counter += params.param1", "lang" : "painless", "params" : {"param1" : 2}}, "scripted_upsert": true, "upsert" : {"counter" : 1}}
'

In this example I am inserting the same thing twice, only _id value is different, so I expect two identical documents to be inserted into the ES.

Query the index:

$ curl -X GET localhost:9200/index3/_search

{
  "took": 547,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "index3",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "counter": 7
        }
      },
      {
        "_index": "index3",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "counter": 3
        }
      }
    ]
  }
}

Note that the counter field for the first document is incorrect and is different from the counter field value from the second document.

3 is a correct value and 7 is not a correct value according to my expectations.

@jimczi jimczi added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Oct 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Scripting)

@jimczi jimczi added :Core/Infra/Core Core issues without another label >bug labels Oct 30, 2019
@rjernst
Copy link
Member

rjernst commented Nov 25, 2019

Thanks for simple reproduction instructions. I was able to reproduce this and track down the problem. A scripted upsert for the first doc in a bulk request may run many times (for me it ran only once, and thus I got 5 for the counter) depending on how long the underlying mapping update takes when needed. We have a loop in TransportShardBulkAction.performOnPrimary that tries executing the bulk action but returns early if an async mapping update was kicked off (as it was in my case, since I did not create the index or mappings ahead of time).

I'm moving this to the distrib team to determine the best fix. Seems like we need to either make the operation idempotent, so the potential subsequent runs start with a fresh doc again, or start and wait on mapping updates before even attempting to index the doc on the primary.

@rjernst rjernst added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Core/Infra/Core Core issues without another label :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache labels Nov 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@ywelsch
Copy link
Contributor

ywelsch commented Nov 26, 2019

I've created a fix here: #49578

ywelsch added a commit that referenced this issue Nov 27, 2019
Fixes a bug where a scripted upsert that causes a dynamic mapping update is retried (because
mapping update is still in-flight), and the request is mutated multiple times.

Closes #48670
ywelsch added a commit that referenced this issue Nov 27, 2019
Fixes a bug where a scripted upsert that causes a dynamic mapping update is retried (because
mapping update is still in-flight), and the request is mutated multiple times.

Closes #48670
ywelsch added a commit that referenced this issue Nov 27, 2019
Fixes a bug where a scripted upsert that causes a dynamic mapping update is retried (because
mapping update is still in-flight), and the request is mutated multiple times.

Closes #48670
@jeejeeone
Copy link

This problem seems to be in 6.5 also. Is there a reliable workaround? Would storing painless script into elastic help? Anything else? Any help appreciated! @ywelsch @rjernst

@henningandersen
Copy link
Contributor

One possible workaround is to ensure the mappings are in place on the index before doing the update, avoiding the dynamic mapping update.

@jeejeeone
Copy link

@henningandersen Thanks! This was very helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants