OpenDistro Security Parallel Writes #206

timcosta · 2021-07-23T19:58:37Z

Hello! I've been debugging an issue I run into sometimes with the AWS managed ES where multiple concurrent operations result in 500 errors. I reproduced it with debug logging enabled, and saw the following in the output:

2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.853-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Failed to create OpenDistroKibanaTenant: error creating tenant: elastic: Error 500 (Internal Server Error): [] 
2021-07-23T14:26:07.853-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Failed to create OpenDistroKibanaTenant: error creating tenant: elastic: Error 500 (Internal Server Error): []

I've opened a support case with AWS about this as I don't think 8 concurrent pings should result in 500s, but I was a little surprised to see that the provider creates a new ES Client for every operation, which triggers a ping for every single operation.

I'm considering making a PR to refactor this to either cache the client or create it earlier in the process, but am wondering if this is a purposeful design choice for a reason I'm not aware of. Again, I don't think it's a problem with the provider implementation that's causing the 500s which is why I opened the case with AWS, but wanted to see if optimizing this a bit was desirable as it would speed up a lot of operations theoretically.

phillbaker · 2021-08-02T02:25:27Z

Hello - sorry to hear this is causing issues, this is a consequence of #119, which intended to make the client instantiation lazy.

I believe it could be cached, but we'd need to make sure that (a) the caching is concurrency safe (as you noticed in #186, this was a failed attempt to cache the client), and (b) it's cached dependent on the config settings so that if the config changes, the client will re-instantiate.

timcosta · 2021-08-02T13:09:53Z

Interesting, okay thank you for that information. It absolutely makes sense why it's deferred, and it makes sense that #186 was a concurrency issue caused by this.

For some extra context, I've heard back again on the AWS case that I opened about the symptom we were seeing that was caused by this and they had a couple of extra thoughts.

The fact that the error we are seeing is Error: error creating tenant: elastic: Error 500 (Internal Server Error): [] seems to mean the ping isn't what is failing, so the original premise of this issue is likely wrong. That error would have to come from

terraform-provider-elasticsearch/es/resource_elasticsearch_opendistro_kibana_tenant.go

Line 218 in c28d2de

return response, fmt.Errorf("error creating tenant: %+v: %+v", err, body)

. If it was a client fetching issue, L202 would short circuit before reaching L218. It appears that the PUT to create a tenant is returning a 500 and empty array rather than a 200 and whatever body would be expected.

How do you feel about adding retry logic to

terraform-provider-elasticsearch/es/resource_elasticsearch_opendistro_kibana_tenant.go

Line 210 in c28d2de

Body: string(tenantJSON),

that would match the retry logic in

terraform-provider-elasticsearch/es/resource_elasticsearch_opendistro_role.go

Line 316 in c81621a

Retrier: elastic7.NewBackoffRetrier(

and

terraform-provider-elasticsearch/es/resource_elasticsearch_opendistro_user.go

Line 224 in c28d2de

Retrier: elastic7.NewBackoffRetrier(

? This feels reasonable given the recent addition of that retry logic, doesnt require the client cache, and might solve these 500 issues by just retrying?

timcosta · 2021-08-03T13:22:02Z

Okay, I've updated the title of this issue based on another round of chats with AWS support.

The OpenDistro Security module maintains a single index for storing the configurations of all tenants, users, roles and roles mappings, and concurrent writes to this index are disallowed in order to maintain idempotency. They offer a bulk write API, but simultaneous calls to the single-write endpoints are apparently error prone.

I can't find this documented everywhere, but I think it's probably super helpful information to you @phillbaker when it comes to this provider and interacting with OpenDistro Security.

phillbaker · 2021-08-04T02:01:35Z

Hi @timcosta thanks for the updates. It looks like you got to the answer, but this seems very similar to #154 which was confusing because upstream OpenDistro returns a 500 instead of the canonical 409 status (issue upstream about this: opensearch-project/security#1095).

I'd recommend confirming that this is the issue by passing -parallelism=1 to terraform which should have no issues. If this is the issue, then the above retries and example code you saw are definitely a good model to apply to the kibana tenant resource, a PR for that would be great.

timcosta changed the title ~~Elasticsearch client isn't reused~~ OpenDistro Security Parallel Writes Aug 3, 2021

phillbaker closed this as completed in 807047b Aug 9, 2021

phillbaker mentioned this issue Aug 12, 2021

[BUG] Return HTTP 409 on version confict when creating & updating tenants opensearch-project/security#1402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenDistro Security Parallel Writes #206

OpenDistro Security Parallel Writes #206

timcosta commented Jul 23, 2021

phillbaker commented Aug 2, 2021

timcosta commented Aug 2, 2021

timcosta commented Aug 3, 2021

phillbaker commented Aug 4, 2021

OpenDistro Security Parallel Writes #206

OpenDistro Security Parallel Writes #206

Comments

timcosta commented Jul 23, 2021

phillbaker commented Aug 2, 2021

timcosta commented Aug 2, 2021

timcosta commented Aug 3, 2021

phillbaker commented Aug 4, 2021