Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow in-place upgrades of ES domains #6243

Merged
merged 4 commits into from
Jan 4, 2019

Conversation

tomelliff
Copy link
Contributor

@tomelliff tomelliff commented Oct 23, 2018

AWS has announced support for in place upgrades of some version of ES domains.

Currently silently passes without making any changes with an invalid upgrade path.
A refresh will show it with the correct, unaltered version on the next plan/apply.
For the list of all valid upgrade paths see https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-version-migration.html

Fixes #5554

Results from running the included acceptance test:

$ make testacc TESTARGS="-run=TestAccAWSElasticSearchDomain_update_version"
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -parallel 20 -run=TestAccAWSElasticSearchDomain_update_version -timeout 120m
?   	github.com/terraform-providers/terraform-provider-aws	[no test files]
=== RUN   TestAccAWSElasticSearchDomain_update_version
--- PASS: TestAccAWSElasticSearchDomain_update_version (3688.54s)
PASS
ok  	github.com/terraform-providers/terraform-provider-aws/aws	3688.554s

So very slow 😢

@ghost ghost added size/S Managed by automation to categorize the size of a PR. service/elasticsearch Issues and PRs that pertain to the elasticsearch service. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure. labels Oct 23, 2018
@bflad bflad added enhancement Requests to existing resources that expand the functionality or scope. waiting-response Maintainers are waiting on response from community or contributor. labels Oct 25, 2018
@ghost ghost added size/M Managed by automation to categorize the size of a PR. and removed size/S Managed by automation to categorize the size of a PR. labels Oct 26, 2018
@tomelliff tomelliff changed the title WIP: Allow in place upgrades of ES domains Allow in-place upgrades of ES domains Oct 26, 2018
@tomelliff
Copy link
Contributor Author

I think this is probably fine for a first pass but longer term there needs to be a configurable timeout block for Elasticsearch domains as I'm wary that upgrades of large ES domains could take a fair bit longer than 60 minutes and we should also think about validating the upgrade path using PerformCheckOnly or GetCompatibleElasticsearchVersions instead of just blindly accepting anything and then silently ignoring it.

Not sure if this needs documenting anywhere but if so we should probably at least link to the valid upgrade paths documentation.

AWS now allows in place upgrades of some version of ES domains.

Currently silently passes without making any changes with an invalid upgrade path.
For the list of all valid upgrade paths see https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-version-migration.html
@rifelpet
Copy link
Contributor

I was messing around with GetCompatbileElasticsearchVersions to determine at plan time whether a version change should ForceNew or not. It is untested but it may be worth integrating into this PR. Thanks for working on it!

rifelpet@a94b46e

@tomelliff
Copy link
Contributor Author

tomelliff commented Oct 26, 2018

I'm not sure it's necessary to be have the migration path detection initially but it would be a welcome addition to this PR. Unfortunately after cherry picking your change across onto my work it doesn't seem to work and the plan is always showing it can upgrade in place:

$ aws es get-compatible-elasticsearch-versions --domain-name tf-test-2
{
    "CompatibleElasticsearchVersions": [
        {
            "SourceVersion": "5.5", 
            "TargetVersions": [
                "5.6"
            ]
        }
    ]
}
$ TF_VAR_version=6.2 terraform plan
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

aws_elasticsearch_domain.example2: Refreshing state... (ID: arn:aws:es:eu-west-1:340242599322:domain/tf-test-2)

------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  ~ aws_elasticsearch_domain.example2
      elasticsearch_version: "5.5" => "6.2"


Plan: 0 to add, 1 to change, 0 to destroy.

------------------------------------------------------------------------

Note: You didn't specify an "-out" parameter to save this plan, so Terraform
can't guarantee that exactly these actions will be performed if
"terraform apply" is subsequently run.

I'm finished for the day now so can't take a longer look until probably next week if you're still stuck but if you do get it working I'd be more than happy to cherry pick your commits on to this change.

@rifelpet
Copy link
Contributor

rifelpet commented Oct 26, 2018

I just updated that commit based on your feedback and forked your branch and it appears to be working as expected. https://github.com/rifelpet/terraform-provider-aws/tree/es_upgrade

  ~ aws_elasticsearch_domain.valid_in_place
      elasticsearch_version: "5.5" => "5.6"
-/+ aws_elasticsearch_domain.force_new (new resource required)
      id:                    "arn:aws:es:us-east-1:0000000000:domain/abc123" => <computed> (forces new resource)
      elasticsearch_version: "5.5" => "6.3" (forces new resource)
-/+ aws_elasticsearch_domain.invalid_version (new resource required)
      id:                    "arn:aws:es:us-east-1:0000000000:domain/abcdefg" => <computed> (forces new resource)
      elasticsearch_version: "6.3" => "6.4" (forces new resource)

Agreed that it would be a nice to have and not necessary for the initial implementation

@tomelliff
Copy link
Contributor Author

tomelliff commented Oct 29, 2018

Awesome, thanks @rifelpet, had a quick play with the branch and it seems to work now. I've left some feedback on your commit rifelpet@0ad0cc7 that it would be cool to see if you agree with it or not. I think we can probably add your changes on to this branch as they do simplify things a bit and make it a more obvious way to look at upgrade paths from a user perspective.

I'd also like to think about the acceptance test(s) a little bit more and if we can test that it actually does do in place/not in place upgrades for expected paths as right now all it proves is that the version does in fact change but doesn't tell you if it did an in place upgrade or destroyed and rebuilt the ES domain.

@bflad is the acceptance test issue mentioned above something you'd want to see added before this is merged? Without spending much time thinking about it I'm not sure what the best way is to detect this right now but can take a longer look at it or if you happen to know of a good way to do that then a pointer in the right direction would be much appreciated. I know the aws_instance resource has a similar test but that's made easier by EC2 instances having a unique ID that would change on each rebuild (not yet sure if the endpoint on the ES domain changes on rebuild or if it's like RDS and the "random looking" part of the endpoint is actually seeded by account ID and region or something).

@bflad
Copy link
Contributor

bflad commented Oct 29, 2018

@tomelliff generally by preference, we can check if:

  • ID changed (e.g. testAccCheckAwsSubnetNotRecreated())
  • Created/updated timestamp changed (e.g. testAccCheckAWSEc2FleetNotRecreated())
  • Some other unique, computed attribute changed

For Elasticsearch domains, it looks like DomainId might be a good candidate: https://docs.aws.amazon.com/sdk-for-go/api/service/elasticsearchservice/#ElasticsearchDomainStatus

Here's an example implementation of one:

https://github.com/terraform-providers/terraform-provider-aws/blob/5aaf64f80342d36f55185b778a372bf50175813d/aws/resource_aws_ec2_fleet_test.go#L1105-L1113

@tomelliff
Copy link
Contributor Author

DomainId is sadly just the account ID and the domain name concatenated together with a / separator so that won't work. I just checked and it looks like the endpoint also comes back as the same when recreating the ES domain just like RDS does too 😢

I can't see anything that exposes anything unique in the calls we're making right now but we could look at the DescribeElasticsearchDomainConfig call which has a CreationDate under out.DomainConfig.ElasticsearchClusterConfig.Status.

Want to block this until I've had a chance to work through necessary changes for the tests?

@rifelpet
Copy link
Contributor

@tomelliff I addressed most of your feedback in a new commit to my branch. I can squash it when we're ready. Regarding returning an error if the GetCompatibleElasticsearchVersions call fails, I'm not sure we can do that. The function signature doesn't return an error type, and none of the other instances of customdiff.ForceNewIf in the provider use any error handling. I wasnt sure whether in that case it should default to forcing a new resource or not, or if theres a different function we can use that does return an error.

Regarding the GetCompatibleElasticsearchVersions response:

If a version has no compatible upgrades, it returns an empty array. This is a 2.3 cluster:

{
    "CompatibleElasticsearchVersions": []
}

If it does have compatible upgrades, the response looks like:

{
    "CompatibleElasticsearchVersions": [
        {
            "SourceVersion": "5.5",
            "TargetVersions": [
                "5.6"
            ]
        }
    ]
}

I believe CompatibleElasticsearchVersions is a list because the DomainName parameter is optional which results in returning a list of all source versions and their possible target versions. Since we're always specifying a specific DomainName, I believe the CompatibleElasticsearchVersions list will always have a length 0 or 1, though no documentation confirms this. The for loop in the ForceNewIffunction should handle possible future scenarios where a SourceVersion can have many TargetVersions.

@tomelliff
Copy link
Contributor Author

Derp, misread the thing you were length checking, that's fine, sorry.

Changes look good to me if you want to squash and I'll cherry pick those changes on to this branch and actually push it up to my fork.

@rifelpet
Copy link
Contributor

I squashed my branch, feel free to cherry pick the commit.

@tomelliff
Copy link
Contributor Author

@rifelpet done, thanks for that, really nice addition.

I'll take a look at the acceptance test(s) to test when we do/don't delete the cluster when updating the version when I get a chance later this week.

@rifelpet
Copy link
Contributor

Do we know how terraform will behave if an upgrade takes longer than the 1h timeout? Will it try to re-upgrade, which will presumably fail? There is a GetUpgradeStatus API call to get the intermediate state of the cluster, maybe we can utilize that somehow?

Most of the clusters I manage with terraform are over a terrabyte of data so I think I'll be hitting the timeout but I wont be able to test it until this gets merged and released.

@tomelliff
Copy link
Contributor Author

I don't have a good way to test that without writing a ton of useless data to an index that I'm happy to murder during testing but if you could snap an existing large cluster and restore it into another domain that you are happy to play with that would be good.

I think Terraform will time out and throw an error. If the user then attempts to run an upgrade while the existing upgrade is in process then the AWS API will throw an error. If they wait until the upgrade is complete then the refresh will then show nothing to do on the plan.

Probably worth checking that assumption though and yeah we do need to think about configurable timeouts but I'd rather leave that to a separate, smaller PR for now to minimise the change scope.

@rifelpet
Copy link
Contributor

@bflad what work is remaining to get this merged?

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Dec 12, 2018
@tomelliff
Copy link
Contributor Author

I need to make the acceptance test actually verify we didn't delete the ES cluster when upgrading. It requires a bit of rework of how we're testing things and unfortunately I've only had time to contribute to open source projects on my commute where I don't have a stable enough internet connection to be building and deleting ES clusters because they take so damn long to apply and destroy.

Hopefully I'll get some time over the Christmas break or possibly even this weekend.

@tomelliff
Copy link
Contributor Author

Acceptance test after most recent commits:

make testacc TESTARGS="-run=TestAccAWSElasticSearchDomain_update_version"
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -parallel 20 -run=TestAccAWSElasticSearchDomain_update_version -timeout 120m
?   	github.com/terraform-providers/terraform-provider-aws	[no test files]
=== RUN   TestAccAWSElasticSearchDomain_update_version
=== PAUSE TestAccAWSElasticSearchDomain_update_version
=== CONT  TestAccAWSElasticSearchDomain_update_version
--- PASS: TestAccAWSElasticSearchDomain_update_version (3798.75s)
PASS
ok  	github.com/terraform-providers/terraform-provider-aws/aws	3798.758s

…supported upgrade path

Not currently checking that unsupported upgrade paths leads to them being recreated.

Unfortunately we can't just switch all of the tests and helpers over to using the ElasticSearchDomainConfig because the testAccLoadESTags helper needs the ARN which isn't available or easily retrievable when using the ElasticSearchDomainConfig struct.
This PR was originally created before the move to parallel acceptance tests so needed fixing up.
@tomelliff
Copy link
Contributor Author

Sigh. Just realised I was comparing the same created time to itself by making a copy/paste error.

Force pushed fix back over the top and rerunning the insanely long acceptance test now.

@tomelliff
Copy link
Contributor Author

Acceptance test completed fine:

make testacc TESTARGS="-run=TestAccAWSElasticSearchDomain_update_version"
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -parallel 20 -run=TestAccAWSElasticSearchDomain_update_version -timeout 120m
?   	github.com/terraform-providers/terraform-provider-aws	[no test files]
=== RUN   TestAccAWSElasticSearchDomain_update_version
=== PAUSE TestAccAWSElasticSearchDomain_update_version
=== CONT  TestAccAWSElasticSearchDomain_update_version
--- PASS: TestAccAWSElasticSearchDomain_update_version (3789.38s)
PASS
ok  	github.com/terraform-providers/terraform-provider-aws/aws	3789.394s

Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @tomelliff and @rifelpet! LGTM 🚀

--- PASS: TestAccAWSElasticSearchDomain_encrypt_at_rest_default_key (753.89s)
--- PASS: TestAccAWSElasticSearchDomain_importBasic (762.44s)
--- PASS: TestAccAWSElasticSearchDomain_LogPublishingOptions (819.86s)
--- PASS: TestAccAWSElasticSearchDomain_policy (825.95s)
--- PASS: TestAccAWSElasticSearchDomain_v23 (868.39s)
--- PASS: TestAccAWSElasticSearchDomain_complex (987.62s)
--- PASS: TestAccAWSElasticSearchDomain_encrypt_at_rest_specify_key (1144.27s)
--- PASS: TestAccAWSElasticSearchDomain_vpc (1226.03s)
--- PASS: TestAccAWSElasticSearchDomain_NodeToNodeEncryption (1317.03s)
--- PASS: TestAccAWSElasticSearchDomain_duplicate (1478.16s)
--- PASS: TestAccAWSElasticSearchDomain_basic (1617.45s)
--- PASS: TestAccAWSElasticSearchDomain_tags (1671.03s)
--- PASS: TestAccAWSElasticSearchDomain_internetToVpcEndpoint (2259.24s)
--- PASS: TestAccAWSElasticSearchDomain_CognitoOptionsCreateAndRemove (2278.74s)
--- PASS: TestAccAWSElasticSearchDomain_update (2343.72s)
--- PASS: TestAccAWSElasticSearchDomain_vpc_update (2650.67s)
--- PASS: TestAccAWSElasticSearchDomain_CognitoOptionsUpdate (2894.05s)
--- PASS: TestAccAWSElasticSearchDomain_update_volume_type (3670.57s)
--- PASS: TestAccAWSElasticSearchDomain_update_version (3812.46s)
--- PASS: TestAccAWSElasticSearchDomain_withDedicatedMaster (3842.67s)

@bflad bflad added this to the v1.55.0 milestone Jan 4, 2019
@bflad bflad merged commit 49bb0ae into hashicorp:master Jan 4, 2019
bflad added a commit that referenced this pull request Jan 4, 2019
@bflad
Copy link
Contributor

bflad commented Jan 10, 2019

This has been released in version 1.55.0 of the AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Apr 1, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Requests to existing resources that expand the functionality or scope. service/elasticsearch Issues and PRs that pertain to the elasticsearch service. size/M Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Elasticsearch in place upgrades
3 participants