Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Peering #1273

Merged
merged 11 commits into from
Jun 16, 2022
Merged

Cluster Peering #1273

merged 11 commits into from
Jun 16, 2022

Conversation

ndhanushkodi
Copy link
Contributor

Changes proposed in this PR:

  • Best reviewed by commit, plan is to rebase and merge (not squash) this branch into main
  • Adds PeeringAcceptor controller run as part of the connect-inject deployment
  • Adds PeeringDialer controller run as part of the connect-inject deployment
  • Adds PeeringAcceptor and Peering Dialer CRs
  • Adds Peer field to Exported Services CR and Intentions CR
  • Allows ExportedServices CR to be used in OSS (previously it was used only for cross-partition in ENT)
  • Adds Helm value global.peering.enabled which will deploy the controllers and CRs for peering, if connectInject.enabled is also true.

How I've tested this PR:
unit tests and manual testing using steps here

How I expect reviewers to test this PR:
👀

Checklist:

  • Tests added
  • CHANGELOG entry added

    HashiCorp engineers only, community PRs should not add a changelog entry.
    Entries should use present tense (e.g. Add support for...)

@@ -142,14 +142,13 @@ jobs:
- run: mkdir -p ${{env.TEST_RESULTS}}
- run: echo "$HOME/bin" >> $GITHUB_PATH

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file will need to be reverted before merging-- but we need the Consul release to happen first. Then we need to bump the CONSUL_VERSION variable to be the latest Consul release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we won't be able to merge this until after Consul's release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yup

@@ -142,14 +142,13 @@ jobs:
- run: mkdir -p ${{env.TEST_RESULTS}}
- run: echo "$HOME/bin" >> $GITHUB_PATH

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we won't be able to merge this until after Consul's release?

charts/consul/templates/connect-inject-clusterrole.yaml Outdated Show resolved Hide resolved
control-plane/api/v1alpha1/exportedservices_types.go Outdated Show resolved Hide resolved
@@ -69,6 +68,8 @@ type ExportedService struct {
type ServiceConsumer struct {
// Partition is the admin partition to export the service to.
Partition string `json:"partition,omitempty"`
// [Experimental] PeerName is the name of the peer to export the service to.
PeerName string `json:"peerName,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should this be Peer to match partition above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have matched the name from the consul config entry so far but I actually do prefer this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its ok to have a different field name from the consul config entry/have we done this before? Should we consider changing the config entry field in core to also match?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could change this in core as well TBH. I think peer is verbose enough.

control-plane/connect-inject/peering_dialer_controller.go Outdated Show resolved Hide resolved
Comment on lines +104 to +99
if specSecret == nil {
err = errors.New("PeeringDialer spec.peer.secret does not exist")
r.updateStatusError(ctx, dialer, err)
return ctrl.Result{}, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should getSecret return an error for this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, this step will move into a validation webhook. and trying to get a secret that does not exist is a valid use case for when we want to get the secret from the status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 In this case though we're getting the secret from the spec not status. And the call above on line 88 gets the secret. Then here we're checking if secret is nil and returning an error, so I'm wondering why not combine these two calls into one and just make getSecret return an error when the secret doesn't exist instead of returning an error here.

Maybe I'm misunderstanding something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah! sorry. I should have been specific. We use the method to get the status secret (L110) too. And if it is the first reconcile, it is ok if the secret doesn't exist, so it would be ok for getSecret to not return an error if the secret does not exist.

control-plane/connect-inject/peering_dialer_controller.go Outdated Show resolved Hide resolved
@ishustava
Copy link
Contributor

@ndhanushkodi will changelog be added separately or in this PR?

Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of follow-up comments, but won't block approval as everything looks great!

❤️ ❤️ ❤️

@@ -11,13 +11,13 @@ import (
// the metrics merging server when metrics merging feature is enabled.
// It always disables service registration because for connect we no longer
// need to keep services registered as this is handled in the endpoints-controller.
func (h *Handler) consulSidecar(pod corev1.Pod) (corev1.Container, error) {
metricsPorts, err := h.MetricsConfig.mergedMetricsServerConfiguration(pod)
func (w *MeshWebhook) consulSidecar(pod corev1.Pod) (corev1.Container, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Comment on lines +469 to +470
if source.Peer != "" && source.Partition != "" {
errs = append(errs, field.Invalid(path.Child("sources").Index(i), source, `Both source.peer and source.partition cannot be set.`))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes I do mean that! Thank you for reading my mind!


service := parts[0]

pieces := strings.Split(service, ".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good!

Comment on lines 77 to 84
if err := r.deletePeering(ctx, req.Name); err != nil {
return ctrl.Result{}, err
}
if acceptor.Secret().Backend == "kubernetes" {
if err := r.deleteK8sSecret(ctx, acceptor); err != nil {
return ctrl.Result{}, err
}
}
controllerutil.RemoveFinalizer(acceptor, FinalizerName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that is surprising, but I suppose is a good thing for this case since this will just work.

Comment on lines +104 to +99
if specSecret == nil {
err = errors.New("PeeringDialer spec.peer.secret does not exist")
r.updateStatusError(ctx, dialer, err)
return ctrl.Result{}, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 In this case though we're getting the secret from the spec not status. And the call above on line 88 gets the secret. Then here we're checking if secret is nil and returning an error, so I'm wondering why not combine these two calls into one and just make getSecret return an error when the secret doesn't exist instead of returning an error here.

Maybe I'm misunderstanding something.


// After reconciliation, Consul should have a peering with its DeletedAt set.
peering, _, err := consulClient.Peerings().Read(context.Background(), "dialer-deleted", nil)
require.NotNil(t, peering.DeletedAt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the tests fail with a nil pointer exception here. I think its because the deletion logic was implemented in Consul, and the peering is sometimes being deleted before we can read it here. So I don't think we can make this assertion anymore. Maybe we can do a retry to check that the peering is nil? Since it's a deferred delete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to build a newer Consul image. I ran into the same issue but turns out, I had an older Consul binary.

Copy link
Contributor Author

@ndhanushkodi ndhanushkodi Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the new binaries for consul/consulent and saw the same issue in the pipeline over multiple runs with the new binaries. But locally with those same binaries I was seeing different results. I think it is the deferred deletion logic that was added since I added a retry that waits 5 seconds and retries asserting that the peering is nil instead of looking at the DeletedAt stamp, and now it doesn't intermittently fail with nil pointer exceptions. I'll try pushing it now and seeing if it passes in the pipeline. Edit: I see it pass the pipeline now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THE DEFERRED DELETION!!! ok ok. that makes sense. thank you so much Nitya! that was an excellent catch.

Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woot woot! lets get this puppy merged today!!

Ashwin Venkatesh and others added 9 commits June 16, 2022 11:25
Adds new PeeringAcceptor CRD and controller with unit tests.

This controller determines whether to generate a new peering token in Consul by reconciling the PeeringAcceptor resource. It currently supports writing the peering token to a K8s secret.

Generating the new CRD and API:
After updating control-plane/PROJECT L6 to reference the control plane
folder, ran the command:
`operator-sdk create api --group consul --version v1alpha1 --kind Peering --controller --namespaced=true --make=false --resource=true`
PeeringDialer CRD and controller, not including unit tests.
- Clean up some of the logic in peering dialler and acceptor
- Rename handler -> connectWebhook
update source intentions with peer name and bump api module and change initiate -> establish
- Remove ownerRef from secret created by acceptor.
- when global.peering.enabled is true, and connectInject.enabled is true, enable the peering controllers and the peering CRDs.
thisisnotashwin and others added 2 commits June 16, 2022 11:27
- rename exported service CRD consumer.peerName field from peerName -->
peer
- rename ConnectWebhook --> MeshWebhook
- add some missing unit test cases
- only add permissions to connect-inject service account for peering CRs
when peering is enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants