Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage routing wildcard domain #59

Merged
merged 2 commits into from
Nov 19, 2018

Conversation

ironcladlou
Copy link
Contributor

@ironcladlou ironcladlou commented Nov 9, 2018

Add routing wildcard DNS management for AWS. This should be replaced ASAP by something more generalized (e.g. external-dns).

The records are managed in a way that's compatible with the installer and are correctly cleaned up during a cluster destroy.

Does not yet handle updates or deletes.

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 9, 2018
@ironcladlou
Copy link
Contributor Author

/hold

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 9, 2018
@@ -49,7 +49,7 @@ func (f *Factory) DefaultClusterIngress(ic *util.InstallConfig) (*ingressv1alpha
if err != nil {
return nil, err
}
ingressDomain := fmt.Sprintf("%s.%s", ic.Metadata.Name, ic.BaseDomain)
ingressDomain := fmt.Sprintf("*.apps.%s.%s", ic.Metadata.Name, ic.BaseDomain)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break TestDefaultClusterIngress. Also, do we want to publish the "*." in the canonical name in routes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, canonical name is the name of the DNS wildcard root domain and API consumers should assume that it is ("%s.%s", myName, canonicalHostname). I.e. canonical should always be reported as a.b.c and consumers must assume their name is d.a.b.c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should add the "*." in ensureDNSForLoadBalancer. Not sure about the "apps." though—probably should add "apps." here and adjust TestDefaultClusterIngress to keep it passing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't yet have any cluster config for the app domain, so I'm just making things up here. Do agree that the *. should be prefixed at runtime in ensureDNSForLoadBalancer().

@ironcladlou
Copy link
Contributor Author

Extracted #60 and #61. Will rebase this soon.

@ironcladlou
Copy link
Contributor Author

@smarterclayton @abhinavdahiya thoughts on how we can approach hosted zone config/discovery here? We need to know:

  1. The zone in which the alias should live
  2. The zone of the alias target

I'm not entirely clear how 1 is managed in the first place, and for 2 all we have to start with is the hostname attached to the Service status.

@abhinavdahiya
Copy link
Contributor

@smarterclayton
Copy link
Contributor

smarterclayton commented Nov 9, 2018 via email

@ironcladlou ironcladlou force-pushed the external-dns branch 2 times, most recently from 03443f3 to 1ad40b3 Compare November 13, 2018 22:36
@ironcladlou
Copy link
Contributor Author

Did some more work on this to auto-discover all the inputs... probably worth another review at this point before I go any further. PTAL @openshift/sig-network-edge @smarterclayton

@ironcladlou
Copy link
Contributor Author

@dgoodwin @joelddiaz @smarterclayton @abhinavdahiya

One cleanup problem here I just thought of. The hive DNS cleanup algorithm is basically to delete public zone records which match the cluster's private zone records. So, to get hive to clean up public zone A records created by the ingress operator, I think matching private zone records would also need to also be created. I don't have the private hosted zone ID in cluster config, and all clusters' zones share a domain name, leaving me with some less than ideal lookup options (e.g. "find the private zone containing an A record for $cluster_name-api .$basedomain").

Thoughts?

@@ -242,6 +242,12 @@ func (h *Handler) ensureRouterForIngress(ci *ingressv1alpha1.ClusterIngress) err
} else if !errors.IsAlreadyExists(err) {
return fmt.Errorf("failed to create router service %s/%s: %v", service.Namespace, service.Name, err)
}
if ci.Spec.IngressDomain != nil {
err = h.ensureDNSForLoadBalancer(ci, service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we definitely need to think about how to rate limit how often this can be ensured (i.e. should we have a rate limiter that requeues us when the limit expires, should we have a periodic timer, etc).

pkg/stub/dns.go Outdated
creds := credentials.NewStaticCredentials(string(awsCreds.Data["aws_access_key_id"]), string(awsCreds.Data["aws_secret_access_key"]), "")
sess, err := session.NewSession(&aws.Config{
Credentials: creds,
Region: aws.String(h.InstallConfig.Platform.AWS.Region),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO here to pull this from the Infrastructure config, not InstallConfig.

@joelddiaz
Copy link
Contributor

The private zones do get some tagging applied to them (eg tectonicClusterID)... Not sure how helpful that is (or whether we want to embed this kind of tag lookup in the operator/controllers).

I'd feel more comfortable with the tags if the hosted zones were being tagged with the kubernetes.io/cluster/$clustername tag. There's an old discussion on this topic here openshift/installer#458

The uninstall code can also be updated to not require the matching private zone, but the question then becomes how would it unambiguously identify the records that need removal in the shared public zone?

@ironcladlou
Copy link
Contributor Author

Thanks, @joelddiaz. I didn't realize the zone was tagged. I think lookup using the tectonicClusterID tag on the private zone and creating a private zone record will have to be good enough for now.

@ironcladlou
Copy link
Contributor Author

@openshift/sig-network-edge @smarterclayton okay, I think this is ready for near final review... I've been using some uncommitted tests for it; we probably need some e2e coverage.

Overall I'm extremely uncomfortable maintaining this code and it's probably unsustainable beyond like 2 platforms (aws/gcp), but I'm hoping it's good enough to keep us moving...

@ironcladlou ironcladlou force-pushed the external-dns branch 2 times, most recently from fc7292e to 37601fb Compare November 16, 2018 19:06
@ironcladlou
Copy link
Contributor Author

I added some cheesy logic to cache updates operator-side to stop spamming AWS with UPSERT calls. Can come up with a better way later.

@ironcladlou
Copy link
Contributor Author

@smarterclayton

error: error reading /tmp/cluster/telemeter-token: no such file or directory
error: no objects passed to apply
deployment.apps/telemeter-client created
/usr/bin/openshift-tests
which: no extended.test in (/usr/libexec/origin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin)
must provide TEST_SUITE variable

Looks like something's up with our job

@ironcladlou
Copy link
Contributor Author

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 16, 2018
@ironcladlou
Copy link
Contributor Author

@csrwng you might want to review what we're doing here as part of the external-dns use case analysis.

// Find the target hosted zone of the load balancer attached to the service.
// TODO: cache it?
var targetHostedZoneID string
loadBalancers, err := m.ELB.DescribeLoadBalancers(&elb.DescribeLoadBalancersInput{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need the elbv2 version of this call to catch the new network load balancers we're using

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain further? I'm having no issues w/ the ELBs created by the k8s cloud provider for our services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, sorry I was thinking of the nlbs that get created by the installer for masters, but those are not the ones created by the cloud provider.

}

// findClusterPublicZone finds the public zone given the base domain.
func (m *Manager) findClusterPublicZone() *route53.HostedZone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to send something like this upstream to external-dns ... add the ability to lookup zones by tags. In the meantime, the code will live in the operator.

// Create or update an alias from the wildcard domain to the service load
// balancer hostname in both zones.
// TODO: only make this call if the records don't exist.
updateRecord := func(zoneID string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that external-dns can do for you right now. Btw, if I set the --registry=noop flag, I don't have it create the ugly TXT records.

@ironcladlou
Copy link
Contributor Author

P.S., current code won't handle deletes or updates to the ClusterIngress. Can address that in a followup; need to get some sort of e2e coverage on this.

@ironcladlou
Copy link
Contributor Author

/retest

Namespace: namespace,
ManifestFactory: manifests.NewFactory(),
}
handler, err := createHandler(namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check err.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


ic, err := util.GetInstallConfig(kubeClient)
if err != nil {
return nil, fmt.Errorf("could't get installconfig: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "coudn't" (granted, it was already present before your refactoring).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return nil, fmt.Errorf("Failed to get cvoClient: %v", err)
}

ic, err := util.GetInstallConfig(kubeClient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why separate declaration of kubeClient from use? Really, because there is only one use, we could do away with the variable entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// records Manager creates in the public zone will also be created in the
// private zone to ensure public zone records for the cluster can be identified
// and cleaned up later.
type Manager struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names seem backwards: in the aws package, you define Manager, so we have aws.Manager, and in the dns package, you define DNSManager, so we have dns.DNSManager. It would make better sense to have aws.DNSManager and dns.Manager.

Alternatively, if we anticipate that the aws package will be imported as "awsdns" and if it doesn't make the code too confusing, we could change the name only in the aws package so that we will have dns.Manager and awsdns.Manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thoughts... I went with option 2 (dns.Manager and awsdns.Manager).

// and cleaned up later.
type Manager struct {
ELB *elb.ELB
Route53 *route53.Route53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these two fields public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

sdk.Handle(handler)
sdk.Run(context.TODO())
}

func createHandler(namespace string) (*stub.Handler, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to pkg/stub/handler.go and maybe renamed to "NewHandler".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost didn't even refactor this -- I think we have more to think about here in terms of setup. Okay to revisit after the sdk upgrade?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's only a suggestion.

@@ -0,0 +1,14 @@
# Binds the operator role to its Service Account.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/operator role to its/aws-creds-secret-reader role to the operator's/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// topology established by the OpenShift installer. Specifically, this implies:
//
// 1. A public zone shared by all clusters with <domain-name>
// 2. A private zone for the cluster with <domain-name>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of "<domain-name>", I think it would be clearer to say "domain name equal to the BaseDomain".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

// ensureDNSForLoadBalancer configures a wildcard DNS alias for a ClusterIngress
// targeting the given service.
func (h *Handler) ensureDNSForLoadBalancer(ci *ingressv1alpha1.ClusterIngress, service *corev1.Service) error {
// Ensure DNS is configured for the load balancer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

},
ObjectMeta: metav1.ObjectMeta{
Name: "aws-creds",
Namespace: "kube-system",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use metav1.NamespaceSystem here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ironcladlou ironcladlou force-pushed the external-dns branch 3 times, most recently from 2fe56f3 to 8c13306 Compare November 16, 2018 23:36

var dnsManager dns.Manager
switch {
case ic.Platform.AWS != nil:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have an InstallConfig with a nil Platform, so we should check that here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Platform is a value type, can't be nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, sorry. Never mind!

AccessID: string(awsCreds.Data["aws_access_key_id"]),
AccessKey: string(awsCreds.Data["aws_secret_access_key"]),
Region: ic.Platform.AWS.Region,
BaseDomain: ic.BaseDomain + ".",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, BaseDomain is allowed to have a final dot, so it might be prudent to do a TrimSuffix to avoid ending up with two dots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


var _ dns.Manager = &Manager{}

// Manager is a DNSManager for AWS which is tightly coupled to the DNS zone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that should be "is a dns.Manager" now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised comment

},
})
if err != nil {
return fmt.Errorf("couldn't update DNS record for in zone %s: %v", zoneID, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two prepositions and only one prepositional object: "for in zone %s".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Add routing wildcard DNS management for AWS. This should be replaced ASAP by
something more generalized (e.g. external-dns).

The records are managed in a way that's compatible with the installer and are
correctly cleaned up during a cluster destroy.

*Does not yet handle updates or deletes.*
@ironcladlou
Copy link
Contributor Author

/retest

1 similar comment
@ironcladlou
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Nov 17, 2018

Are the changes under hack/ intended to be included in this PR? If so, let me know, and I'll go ahead and tag this PR.

@ironcladlou
Copy link
Contributor Author

Yeah, the hack/ changes are just some bonus fixes/features

@Miciah
Copy link
Contributor

Miciah commented Nov 19, 2018

/lgtm

1 similar comment
@Miciah
Copy link
Contributor

Miciah commented Nov 19, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ironcladlou, Miciah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ironcladlou
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@Miciah
Copy link
Contributor

Miciah commented Nov 19, 2018

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants