Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenDistro Security Parallel Writes #206

Closed
timcosta opened this issue Jul 23, 2021 · 4 comments
Closed

OpenDistro Security Parallel Writes #206

timcosta opened this issue Jul 23, 2021 · 4 comments

Comments

@timcosta
Copy link

Hello! I've been debugging an issue I run into sometimes with the AWS managed ES where multiple concurrent operations result in 500 errors. I reproduced it with debug logging enabled, and saw the following in the output:

2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.329-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Pinging url to determine version https://vpc-redacted.eu-west-1.es.amazonaws.com 
2021-07-23T14:26:07.853-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Failed to create OpenDistroKibanaTenant: error creating tenant: elastic: Error 500 (Internal Server Error): [] 
2021-07-23T14:26:07.853-0500 [DEBUG] provider.terraform-provider-elasticsearch_v1.6.0: 2021/07/23 14:26:07 [INFO] Failed to create OpenDistroKibanaTenant: error creating tenant: elastic: Error 500 (Internal Server Error): [] 

I've opened a support case with AWS about this as I don't think 8 concurrent pings should result in 500s, but I was a little surprised to see that the provider creates a new ES Client for every operation, which triggers a ping for every single operation.

I'm considering making a PR to refactor this to either cache the client or create it earlier in the process, but am wondering if this is a purposeful design choice for a reason I'm not aware of. Again, I don't think it's a problem with the provider implementation that's causing the 500s which is why I opened the case with AWS, but wanted to see if optimizing this a bit was desirable as it would speed up a lot of operations theoretically.

@phillbaker
Copy link
Owner

Hello - sorry to hear this is causing issues, this is a consequence of #119, which intended to make the client instantiation lazy.

I believe it could be cached, but we'd need to make sure that (a) the caching is concurrency safe (as you noticed in #186, this was a failed attempt to cache the client), and (b) it's cached dependent on the config settings so that if the config changes, the client will re-instantiate.

@timcosta
Copy link
Author

timcosta commented Aug 2, 2021

Interesting, okay thank you for that information. It absolutely makes sense why it's deferred, and it makes sense that #186 was a concurrency issue caused by this.

For some extra context, I've heard back again on the AWS case that I opened about the symptom we were seeing that was caused by this and they had a couple of extra thoughts.

The fact that the error we are seeing is Error: error creating tenant: elastic: Error 500 (Internal Server Error): [] seems to mean the ping isn't what is failing, so the original premise of this issue is likely wrong. That error would have to come from

return response, fmt.Errorf("error creating tenant: %+v: %+v", err, body)
. If it was a client fetching issue, L202 would short circuit before reaching L218. It appears that the PUT to create a tenant is returning a 500 and empty array rather than a 200 and whatever body would be expected.

How do you feel about adding retry logic to

that would match the retry logic in and ? This feels reasonable given the recent addition of that retry logic, doesnt require the client cache, and might solve these 500 issues by just retrying?

@timcosta timcosta changed the title Elasticsearch client isn't reused OpenDistro Security Parallel Writes Aug 3, 2021
@timcosta
Copy link
Author

timcosta commented Aug 3, 2021

Okay, I've updated the title of this issue based on another round of chats with AWS support.

The OpenDistro Security module maintains a single index for storing the configurations of all tenants, users, roles and roles mappings, and concurrent writes to this index are disallowed in order to maintain idempotency. They offer a bulk write API, but simultaneous calls to the single-write endpoints are apparently error prone.

I can't find this documented everywhere, but I think it's probably super helpful information to you @phillbaker when it comes to this provider and interacting with OpenDistro Security.

@phillbaker
Copy link
Owner

Hi @timcosta thanks for the updates. It looks like you got to the answer, but this seems very similar to #154 which was confusing because upstream OpenDistro returns a 500 instead of the canonical 409 status (issue upstream about this: opensearch-project/security#1095).

I'd recommend confirming that this is the issue by passing -parallelism=1 to terraform which should have no issues. If this is the issue, then the above retries and example code you saw are definitely a good model to apply to the kibana tenant resource, a PR for that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants