-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-endpoint design for cloudstack provider #2399
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2399 +/- ##
==========================================
+ Coverage 56.07% 56.78% +0.71%
==========================================
Files 303 306 +3
Lines 24645 24937 +292
==========================================
+ Hits 13819 14160 +341
+ Misses 9533 9462 -71
- Partials 1293 1315 +22
Continue to review full report at Codecov.
|
// | ||
Domain string `json:"domain"` | ||
// Zones is a list of one or more zones that are managed by a single CloudStack management endpoint. | ||
Zone CloudStackZone `json:"zone"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be Zones []CloudstackZone
? each domain can have multiple zonings right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I believe not. Each failure domain can only have a single zone. If a customer wants a cluster spread across multiple zones, those would be two separate Failure Domains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. The customer ask is to enhance the CAPC Failure Domain, which currently is equivalent to ACS Zone to be Endpoint+Zone+Domain+Account.
ManagementApiEndpoint string `json:"managementApiEndpoint"` | ||
``` | ||
|
||
We would instead propose to remove all the existing attributes and instead, simply include a list of FailureDomain objects, where each FailureDomain object looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to treat each FailureDomain
as a datacenter? If so, is the reason we don't have multiple cloudstackdatacenterconfig purely engineering efforts too high?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This topic is discussed below in "Other approaches considered". There are many instances in the EKS-A codebase where only a single datacenterconfig object is expected to be present, and decisions are based on the Kind of that one datacenterconfig object. To refactor the entire code base only for the benefit of a single provider feels wasteful of engineering resources and more likely to introduce bugs.
With the current approach, we already kind of have multiple failure domains under a CloudstackDatacenterConfig object, only they're limited to the "zone" dimension. We want to expand that to also support the account, domain, and endpoint dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By way of comparison, CAPV supports CAPI Failure Domains via node labeling (see Node Zones/Regions Topology). None the less, EKS-A expects those zones/regions to be managed through a single Datacenter.
ManagementApiEndpoint string `json:"managementApiEndpoint"` | ||
``` | ||
|
||
and we would parse these resources and pass them into CAPC by modifying the templates we have currently implemented. We can then use this new model to read in credentials, perform pre-flight checks, plumb data to CAPC, and support upgrades in the controller. The goal would be to make these new resources backwards compatible via code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious how to map this failureDomain list to CAPC? Can we maybe have some examples here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we pass data to CAPC for zones in https://github.com/aws/eks-anywhere/blob/main/pkg/providers/cloudstack/config/template-cp.yaml#L38. We will do the same thing for Failure Domains. A failure domain should be the same as a zone is now, but also include account/domain and a url for the management host.
### Cloudstack credentials | ||
|
||
|
||
In a multi-endpoint Cloudstack cluster, each endpoint may have its own credentials. We propose that Cloudstack credentials will be passed in via environment variable in the same way as they are currently, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what abt user specifies a file path to the .ini and pass it as ENV?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do that as well if you prefer. Although it might require some changes in the e2e test framework. I think there we're using the encoded value instead of any specific file. I'm not sure how/if we could download the file locally and then use it in the e2e tests but it does seem easier for the customer.
@vivek-koppuru any thoughts on this? I know we've discussed how to have customers pass data in at length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both options could work right? What are we doing for snow again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for snow we pass in file path as ENV vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From some discussions internally, it feels like storing credentials in a transient environment variable is more secure than in a file written to disk. That is why we prefer to go the env var route, although it is slightly more effort for the customer
ManagementApiEndpoint string `json:"managementApiEndpoint"` | ||
``` | ||
|
||
We would instead propose to remove all the existing attributes and instead, simply include a list of FailureDomain objects, where each FailureDomain object looks like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types of changes are breaking changes. Especially if we don't have a feature flag for Cloudstack, what about introducing a new object for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would this list look like here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These types of changes are breaking changes. Especially if we don't have a feature flag for Cloudstack, what about introducing a new object for this use case?
We do have a feature flag for Cloudstack currently. What sort of new object would you introduce?
How would this list look like here?
Currently the CloudstackDatacenterConfig object contains account, domain, zone, managementApiEndpoint, and zones. Zones are a proxy for the old concept of failure domains. Instead, we would propose that the CloudstackDatacenterConfig just contains a single entry, which is a list of FailureDomain objects. Then, each FailureDomain is independent and contains an account, domain, zone, and managementApiEndpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out https://github.com/aws/eks-anywhere/pull/2412/files for the new type definitions. We'll also have a transformer to convert the old attributes to using the new ones, under AvailabilityZone
1. Make all the fields optional and see if customers have the old fields set or the new ones | ||
2. Introduce an eks-a version bump with conversion webhooks | ||
|
||
Between these two approaches, I would take the first and then deprecate the legacy fields in a subsequent release to simplify the code paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can still throw a warning if new clusters are created with the old fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. In summary, we will continue to support the old fields until we can deprecate them.
## Other approaches explored | ||
|
||
Another direction we can go to support this feature is to refactor the entire EKS-A codebase so that instead of all the failure domains existing inside the CloudstackDatacenterConfig | ||
object, each CloudstackDatacenterConfig itself corresponds with a single failure domain. Then, the top level EKS-A Cluster object could be refactored to have a list of DatacenterRefs instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may actually provide value as we talked about wanting to support multiple environments in the future. However, I can see why it's a bigger step with less data for a specific provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. Let's discuss in more detail tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you listed the alternative solution here.
However, this seems like it would be bleeding a provider specific topology into the main cluster object.
If we want to make a more holistic API change that includes other providers, we would need to consider what's the pattern that makes sense for them and find the common denominator among all. And I don't think we have done that.
|
||
The mangement API endpoint for Cloudstack is a potential points of failure. If one endpoint goes down, then control of all of everything goes down. So, we want to spread our services across many Cloudstack endpoints and hosts to protect against that. | ||
|
||
For scalability, multiple Cloudstack endpoints will likely be required for storage and API endpoint throughput for our customer. Just one cluster creation generates probably a thousand API calls to ACS. There are many ways to mitigate this, but adding more Cloudstack hosts and endpoints is a fairly foolproof way to do so. Then, there’s the size and performance of the underlying database that each Cloudstack instance runs on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth instrumenting API calls to ACS during cluster lifecycle events to have precise volumetric data?
## Other approaches explored | ||
|
||
Another direction we can go to support this feature is to refactor the entire EKS-A codebase so that instead of all the failure domains existing inside the CloudstackDatacenterConfig | ||
object, each CloudstackDatacenterConfig itself corresponds with a single failure domain. Then, the top level EKS-A Cluster object could be refactored to have a list of DatacenterRefs instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you listed the alternative solution here.
However, this seems like it would be bleeding a provider specific topology into the main cluster object.
If we want to make a more holistic API change that includes other providers, we would need to consider what's the pattern that makes sense for them and find the common denominator among all. And I don't think we have done that.
Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com>
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: g-gaston The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Initial draft of multi-endpoint design for cloudstack * Adding additional details * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Apply suggestions from code review Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com> * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md * Update cloudstack-multiple-endpoints.md Co-authored-by: Guillermo Gaston <guillermogastonlorente@gmail.com>
Issue #, if available:
#2406
Description of changes:
This design encapsulates a proposed approach for refactoring the CloudStack provider to support creating eks-a clusters across multiple Cloudstack endpoints with different sets of credentials.
Testing (if applicable):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.