Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separating provider-specific code in the installer #566

Closed
wants to merge 1 commit into from
Closed

Separating provider-specific code in the installer #566

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Dec 15, 2020

This change introduces an internal API in the installer code to move provider-specific code into a separate folder and avoid having large switch-case structures in the installer code.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign danehans after the PR has been reviewed.
You can assign the PR to them by writing /assign @danehans in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mdbooth
Copy link
Contributor

mdbooth commented Dec 15, 2020

I'm in favour of this. The details of each 'API' can be hashed out separately one by one as cloud provider code is refactored into it. I don't think it's worth getting into that detail here, except to agree it in principal.

@sdodson
Copy link
Member

sdodson commented Dec 15, 2020

/uncc @stevekuznetsov
/cc @sdodson @staebler @crawford

@openshift-ci-robot openshift-ci-robot requested review from crawford, sdodson and staebler and removed request for stevekuznetsov December 15, 2020 15:11
Copy link
Member

@pierreprinetti pierreprinetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for proposing this!

We've been working in this direction in the OpenStack provider already, and I think that it would bring many advantages: clarity, more focus on the orchestration in common parts, a better integration of optional steps (eg pre-flight validation), and moreover, well-defined expectations, written in code.

A big 👍 from me!

}
```

Not all platforms implement all features. Optional features should be hidden behind a feature flag and a separate interface:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional platform-specific features, like validations, could be detected via type assertion.

This is what the orchestrator code could resemble to:

type InstallConfig struct{}

func validateCloud(ic InstallConfig, platform interface{}) {

	if quotaValidator, ok := platform.(interface {
		ValidateQuota(InstallConfig) error
	}); ok {
		if err := quotaValidator.ValidateQuota(ic); err != nil {
			log.Fatal(err)
		}
	}
}

Each platform can them choose to implement some methods and not others, and the orchestrator code is aware what methods are available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a much better way to do this as I found out. We can create a struct that has receivers for the default functionality and providers can embed this default struct. (Basically, OOP inheritance.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. The way I'm figuring this, Installer-core defines a struct with optional methods defined as no-ops, then the Platform embeds it in a struct where it can override the optional noops with actual logic, then the Installer-core imports that... I see a circular dependency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places where I've created driver-like APIs of this sort, repeating some code always gave us a result that was easier to maintain over time because a developer working on one driver does not have to untangle an inheritance hierarchy or track down the real implementation that may live somewhere else. In places where we could gain from code reuse, utility functions or types that are instantiated rather than inherited are simpler to test and more flexible.

Along the same lines, no-op implementations, even if they have to be provided by every platform, are simpler to understand than features flags and multiple interfaces.


## Proposal

### Step 1: Create a Go interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of defining a single all-encompassing interface, each "installer target" (install-config, manifests, machines etc) could define its own required interface, ideally with just one method each. This way, each logic part can be easily tested and we get cleanly segregated interfaces.


### Step 2: Implement registry

In this step we should implement a registry pattern where providers can register with their names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OpenStack provider, we have started the effort of defining a clean provider interface for Installer steps. For example, for cloudproviderconfig we have come up with this signature:

func GenerateCloudProviderConfig(installConfig types.InstallConfig) (cloudProviderConfigData, cloudProviderConfigCABundleData string, err error)

We would be happy to be guinea pigs for this RFE and help defining and testing interface signatures in-vivo for the other steps as well! We also have that in our backlog.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There ought to be a saying about two guinea pigs being better than one, we (oVirt) are also wanting to have this. Let's do it!


### Version Skew Strategy

As the API develops it presents the problem of version skew. As new API calls are introduced they need to either be added to all providers. If that is not possible a new interface should be introduced and a compatibility wrapper should be added. Once all providers are migrated the temporary interface can be removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New function signatures could be implemented as new "optional" interfaces a provider object can implement, with the mechanism of interface assertion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above, we don't even need to do interface assertion. (Interface assertion is not nice to read and incurs somewhat of a performance penalty.)

Copy link
Member

@pierreprinetti pierreprinetti Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't it be nice? 😁 Conceptually, it's really what we want here: "dear Platform type, do you implement this method?"

As for the performance penalty: I tend to value readability higher, especially when the total penalty is not expected to pass the 1-second threshold in a 40-minutes process. I would even value aesthetics higher than performance here 😛

}
```

Not all platforms implement all features. Optional features should be hidden behind a feature flag and a separate interface:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places where I've created driver-like APIs of this sort, repeating some code always gave us a result that was easier to maintain over time because a developer working on one driver does not have to untangle an inheritance hierarchy or track down the real implementation that may live somewhere else. In places where we could gain from code reuse, utility functions or types that are instantiated rather than inherited are simpler to test and more flexible.

Along the same lines, no-op implementations, even if they have to be provided by every platform, are simpler to understand than features flags and multiple interfaces.


### Step 4: Migrate providers

Providers need to be migrated to the new API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the biggest sources of friction we have had with "everyone needs to..." projects is prioritizing the work on the teams' backlogs. How many different teams will be involved in this part of the work? Are they all aware of this plan?

Copy link
Author

@ghost ghost Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhellmann I have sent out this plan to all teamleads and managers affected as far as I could tell. The benefit of this plan is that no team needs to do anything. In addition, I'm working on a way to make sure that API calls added in the future don't create an immediate requirement to do anything for other teams. PoC will still take a while (probably until new year). The document will need to be partially updated.

Copy link
Contributor

@staebler staebler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do not see enough value in the monolithic interface to outweigh the effort and churn required to implement it. Perhaps with a PoC where I could see the proposed structure more clearly, I could be convinced. With that said, I am whole-heartedly in favor of the owners of specific providers making an effort to better separate their provider-specific code from the core installer code.


## Motivation

Several parts of the installer code have large, [unwieldy switch-case structures](https://github.com/openshift/installer/blob/254680f6268a7b8e57091d788256c67b81755b5d/pkg/asset/cluster/tfvars.go#L166) that contain provider-specific codes. This makes it hard to keep track of the code that serves a provider, and it makes it impossible to write component-level tests. Creating a separated API reduces friction between teams and makes changes easier to merge because the risk of accidental side effects between providers is reduced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, there are three motivations.

  1. You want to be able to find code that serves a provider more easily.
  2. You want to be able to write component-level tests for a specific provider.
  3. You want to be able to merge code specific to a provider without having to get reviews from people on other teams.

Creating a separate API is a solution not a motivation. I contend that you can satisfy all three motivations without have a separate API. As a maintainer of a platform, you are at liberty to extract whatever platform-specific code there is from the unwieldy switch statement and put it into a function in a directory that you own. You can then run whatever component-level tests you like on that function.


### Step 1: Create a Go interface

We should create a Go interface that describes the functionality each provider must implement. For example, from the prototype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface will have 20-30 functions. Some of these functions are going to be difficult to describe if the interface is far from the asset where it is used. As an example, the MCS Certificate asset has different logic for specifying the IP addresses and DNS names for the cert based on the platform. I personally prefer to have the logic for each asset located close to the asset, particularly where the logic is more closely coupled to the asset than the provider.

The logic for some assets that are very specific to a platform would do better farther away from the asset. One example of this is the Platform Credentials Check asset. The credentials check could stand on its own.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, the MCS Certificate asset has different logic for specifying the IP addresses and DNS names for the cert based on the platform. I personally prefer to have the logic for each asset located close to the asset, particularly where the logic is more closely coupled to the asset than the provider.

This makes sense for readability. A counter-argument in favor of the interface, which is not included in this enhancement, is that the current implementation creates a dependency issue. We can see in the code that creating an asset depends on all platforms, creating an oVirt MCS Certificate asset depends on the AWS platform. If we inject the dependencies through an interface, we would resolve these dependency issues.

These dependencies are "theoretically wrong," but I'm not sure if they pose a practical problem.


### Version Skew Strategy

As the API develops it presents the problem of version skew. As new API calls are introduced they need to either be added to all providers. If that is not possible a new interface should be introduced and a compatibility wrapper should be added. Once all providers are migrated the temporary interface can be removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point is a big concern for me. Let's say that a provider needs a new parameter. Either we update every provider at once to add the new parameter. Or we support a new function with the new parameter. In the latter case, I would expect the providers that do not need the new parameter will never update to the new version of the function. Then, later, when a new parameter is added again, we get a third function. As time goes on, I could see every provider ending up with their own function. I look at the machine assets as an example. Every provider has a MachineSets and Machines function, but each one has its own unique signature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue has been addressed in the PoC (see below) I will update the document later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As time goes on, I could see every provider ending up with their own function. I look at the machine assets as an example. Every provider has a MachineSets and Machines function, but each one has its own unique signature.

The parameters in an interface's function signature to generate an asset should just match the dependencies for that asset. Each platform/provider should have a signature that takes in all the dependencies and then returns whatever is needed to generate the asset (i.e. whatever is coming out of the switch statement). The PoC could be improved in this regard, I will leave a comment there.

Copy link
Author

@ghost ghost Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine config is pretty easy to standardize as far as I can tell. In my previous PoC I did this:

	// GetMasterMachines returns a pool of machine configurations to create for the master nodes.
	GetMasterMachines(
		ic *types.InstallConfig,
		pool *types.MachinePool,
		rhcosImage *rhcos.Image,
		clusterID *installconfig.ClusterID,
	) ([]machineapi.Machine, error)

	// AddWorkerMachines adds a set of machines to the worker pool.
	AddWorkerMachines(
		ic *types.InstallConfig,
		pool *types.MachinePool,
		rhcosImage *rhcos.Image,
		clusterID *installconfig.ClusterID,
		machineSets []runtime.Object,
	) error

The API extraction is a rather simple process: there are a number of variables available there the provider-specific code is located and those make it into the API.

@ghost
Copy link
Author

ghost commented Jan 4, 2021

@mdbooth @staebler @dhellmann @pierreprinetti thank you for your input. I have a PoC ready that addresses some of the issues raised here: openshift/installer#4517


## Motivation

Several parts of the installer code have large, [unwieldy switch-case structures](https://github.com/openshift/installer/blob/254680f6268a7b8e57091d788256c67b81755b5d/pkg/asset/cluster/tfvars.go#L166) that contain provider-specific codes. This makes it hard to keep track of the code that serves a provider, and it makes it impossible to write component-level tests. Creating a separated API reduces friction between teams and makes changes easier to merge because the risk of accidental side effects between providers is reduced.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes it impossible to write component-level tests

What do you mean by "component-level tests"? Many platforms have platform-specific unit tests. For example, here are the GCP tests for validating install configs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We (oVirt) have several parts of our code where half of the functionality is in a switch-case, the other half is in a switch-case. It would be helpful to have tests that run our code only without needing to test the whole installer. That makes test runs quicker and less flaky.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't really call them unit tests because they still require external API's, which is a separate issue to be addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the GCP example I linked above you can see how we mocked out the GCP APIs.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 5, 2021
@ghost ghost closed this May 5, 2021
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants