Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-endpoint design for cloudstack provider #2399

Merged
merged 18 commits into from
Jun 17, 2022

Conversation

maxdrib
Copy link
Contributor

@maxdrib maxdrib commented Jun 13, 2022

Issue #, if available:
#2406

Description of changes:
This design encapsulates a proposed approach for refactoring the CloudStack provider to support creating eks-a clusters across multiple Cloudstack endpoints with different sets of credentials.

Testing (if applicable):

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@eks-distro-bot eks-distro-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 13, 2022
@maxdrib maxdrib requested a review from wanyufe June 13, 2022 20:19
@codecov
Copy link

codecov bot commented Jun 13, 2022

Codecov Report

Merging #2399 (8445b50) into main (dc30f16) will increase coverage by 0.71%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #2399      +/-   ##
==========================================
+ Coverage   56.07%   56.78%   +0.71%     
==========================================
  Files         303      306       +3     
  Lines       24645    24937     +292     
==========================================
+ Hits        13819    14160     +341     
+ Misses       9533     9462      -71     
- Partials     1293     1315      +22     
Impacted Files Coverage Δ
pkg/providers/tinkerbell/assert.go 77.77% <0.00%> (-17.82%) ⬇️
pkg/providers/tinkerbell/validate.go 78.51% <0.00%> (-8.99%) ⬇️
pkg/providers/tinkerbell/template.go 28.36% <0.00%> (-4.10%) ⬇️
pkg/providers/tinkerbell/tinkerbell.go 34.93% <0.00%> (-3.43%) ⬇️
.../api/v1alpha1/tinkerbelltemplateconfig_defaults.go 99.00% <0.00%> (-1.00%) ⬇️
pkg/api/v1alpha1/tinkerbellmachineconfig_types.go 3.07% <0.00%> (-0.32%) ⬇️
pkg/providers/tinkerbell/stack/stack.go 82.03% <0.00%> (-0.09%) ⬇️
pkg/task/task.go 87.50% <0.00%> (ø)
pkg/providers/snow/fetch.go 100.00% <0.00%> (ø)
pkg/api/v1alpha1/cluster_types.go 74.42% <0.00%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc30f16...8445b50. Read the comment docs.

@maxdrib maxdrib changed the title Initial draft of multi-endpoint design for cloudstack Multi-endpoint design for cloudstack provider Jun 13, 2022
//
Domain string `json:"domain"`
// Zones is a list of one or more zones that are managed by a single CloudStack management endpoint.
Zone CloudStackZone `json:"zone"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be Zones []CloudstackZone ? each domain can have multiple zonings right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I believe not. Each failure domain can only have a single zone. If a customer wants a cluster spread across multiple zones, those would be two separate Failure Domains

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. The customer ask is to enhance the CAPC Failure Domain, which currently is equivalent to ACS Zone to be Endpoint+Zone+Domain+Account.

ManagementApiEndpoint string `json:"managementApiEndpoint"`
```

We would instead propose to remove all the existing attributes and instead, simply include a list of FailureDomain objects, where each FailureDomain object looks like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to treat each FailureDomain as a datacenter? If so, is the reason we don't have multiple cloudstackdatacenterconfig purely engineering efforts too high?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This topic is discussed below in "Other approaches considered". There are many instances in the EKS-A codebase where only a single datacenterconfig object is expected to be present, and decisions are based on the Kind of that one datacenterconfig object. To refactor the entire code base only for the benefit of a single provider feels wasteful of engineering resources and more likely to introduce bugs.

With the current approach, we already kind of have multiple failure domains under a CloudstackDatacenterConfig object, only they're limited to the "zone" dimension. We want to expand that to also support the account, domain, and endpoint dimension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By way of comparison, CAPV supports CAPI Failure Domains via node labeling (see Node Zones/Regions Topology). None the less, EKS-A expects those zones/regions to be managed through a single Datacenter.

ManagementApiEndpoint string `json:"managementApiEndpoint"`
```

and we would parse these resources and pass them into CAPC by modifying the templates we have currently implemented. We can then use this new model to read in credentials, perform pre-flight checks, plumb data to CAPC, and support upgrades in the controller. The goal would be to make these new resources backwards compatible via code
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious how to map this failureDomain list to CAPC? Can we maybe have some examples here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we pass data to CAPC for zones in https://github.com/aws/eks-anywhere/blob/main/pkg/providers/cloudstack/config/template-cp.yaml#L38. We will do the same thing for Failure Domains. A failure domain should be the same as a zone is now, but also include account/domain and a url for the management host.

### Cloudstack credentials


In a multi-endpoint Cloudstack cluster, each endpoint may have its own credentials. We propose that Cloudstack credentials will be passed in via environment variable in the same way as they are currently,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what abt user specifies a file path to the .ini and pass it as ENV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do that as well if you prefer. Although it might require some changes in the e2e test framework. I think there we're using the encoded value instead of any specific file. I'm not sure how/if we could download the file locally and then use it in the e2e tests but it does seem easier for the customer.

@vivek-koppuru any thoughts on this? I know we've discussed how to have customers pass data in at length.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both options could work right? What are we doing for snow again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for snow we pass in file path as ENV vars

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From some discussions internally, it feels like storing credentials in a transient environment variable is more secure than in a file written to disk. That is why we prefer to go the env var route, although it is slightly more effort for the customer

ManagementApiEndpoint string `json:"managementApiEndpoint"`
```

We would instead propose to remove all the existing attributes and instead, simply include a list of FailureDomain objects, where each FailureDomain object looks like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types of changes are breaking changes. Especially if we don't have a feature flag for Cloudstack, what about introducing a new object for this use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this list look like here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types of changes are breaking changes. Especially if we don't have a feature flag for Cloudstack, what about introducing a new object for this use case?

We do have a feature flag for Cloudstack currently. What sort of new object would you introduce?

How would this list look like here?

Currently the CloudstackDatacenterConfig object contains account, domain, zone, managementApiEndpoint, and zones. Zones are a proxy for the old concept of failure domains. Instead, we would propose that the CloudstackDatacenterConfig just contains a single entry, which is a list of FailureDomain objects. Then, each FailureDomain is independent and contains an account, domain, zone, and managementApiEndpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out https://github.com/aws/eks-anywhere/pull/2412/files for the new type definitions. We'll also have a transformer to convert the old attributes to using the new ones, under AvailabilityZone

1. Make all the fields optional and see if customers have the old fields set or the new ones
2. Introduce an eks-a version bump with conversion webhooks

Between these two approaches, I would take the first and then deprecate the legacy fields in a subsequent release to simplify the code paths.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still throw a warning if new clusters are created with the old fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. In summary, we will continue to support the old fields until we can deprecate them.

## Other approaches explored

Another direction we can go to support this feature is to refactor the entire EKS-A codebase so that instead of all the failure domains existing inside the CloudstackDatacenterConfig
object, each CloudstackDatacenterConfig itself corresponds with a single failure domain. Then, the top level EKS-A Cluster object could be refactored to have a list of DatacenterRefs instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may actually provide value as we talked about wanting to support multiple environments in the future. However, I can see why it's a bigger step with less data for a specific provider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Let's discuss in more detail tomorrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you listed the alternative solution here.

However, this seems like it would be bleeding a provider specific topology into the main cluster object.
If we want to make a more holistic API change that includes other providers, we would need to consider what's the pattern that makes sense for them and find the common denominator among all. And I don't think we have done that.


The mangement API endpoint for Cloudstack is a potential points of failure. If one endpoint goes down, then control of all of everything goes down. So, we want to spread our services across many Cloudstack endpoints and hosts to protect against that.

For scalability, multiple Cloudstack endpoints will likely be required for storage and API endpoint throughput for our customer. Just one cluster creation generates probably a thousand API calls to ACS. There are many ways to mitigate this, but adding more Cloudstack hosts and endpoints is a fairly foolproof way to do so. Then, there’s the size and performance of the underlying database that each Cloudstack instance runs on.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth instrumenting API calls to ACS during cluster lifecycle events to have precise volumetric data?

designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
## Other approaches explored

Another direction we can go to support this feature is to refactor the entire EKS-A codebase so that instead of all the failure domains existing inside the CloudstackDatacenterConfig
object, each CloudstackDatacenterConfig itself corresponds with a single failure domain. Then, the top level EKS-A Cluster object could be refactored to have a list of DatacenterRefs instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you listed the alternative solution here.

However, this seems like it would be bleeding a provider specific topology into the main cluster object.
If we want to make a more holistic API change that includes other providers, we would need to consider what's the pattern that makes sense for them and find the common denominator among all. And I don't think we have done that.

designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
designs/cloudstack-multiple-endpoints.md Outdated Show resolved Hide resolved
maxdrib and others added 2 commits June 14, 2022 17:00
Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com>
@g-gaston
Copy link
Member

/approve

@eks-distro-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: g-gaston

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@eks-distro-bot eks-distro-bot merged commit 5974edc into aws:main Jun 17, 2022
wongni pushed a commit to wongni/eks-anywhere that referenced this pull request Jun 21, 2022
* Initial draft of multi-endpoint design for cloudstack

* Adding additional details

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Apply suggestions from code review

Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com>

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

* Update cloudstack-multiple-endpoints.md

Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants