Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NET-10567] Fix namespace normalization on external registration/ACL Setup for Terminating Gateways #4224

Merged
merged 40 commits into from
Aug 20, 2024

Conversation

jm96441n
Copy link
Member

@jm96441n jm96441n commented Jul 31, 2024

Changes proposed in this PR

  • Fix namespace normalization on external registration (when namespaces are enabled the empty namespace is converted to "default" for acl rules)
  • Fix how acls are setup for terminating gateways -> moved from the registrations controller/cache to the terminating gateway controller to avoid race conditions with persisting the terminating gateway config entry (TODO: clean up extra policies when a terminating gateway is removed/services are removed and are not referenced by any other terminating gateways)
  • Fixed bug in cache where values weren't being persisted with the correct key
  • updated terminating gateway tests so they would catch this in the future

How I've tested this PR

  • wrote tests
  • ran in kind cluster

How I expect reviewers to test this PR

  • clone this branch and build the container using make docker-dev
  • clone down the following gist: https://gist.github.com/jm96441n/1be432dd2b6b5ade7bc1b058539f6095
  • modify the CONSUL_K8S_CHART_LOCATION variable in the start.sh file to point to the helm charts in your local version of consul-
  • run the start.sh file (this requires you to have kind and yq on your machine, and you'll need to run chmod +x ./start.sh)
  • after that completes run the following to set up your consul CLI to talk to consul:
export CONSUL_HTTP_ADDR="127.0.0.1:8501"
export CONSUL_HTTP_TOKEN=$(kubectl get --namespace consul secrets/consul-consul-bootstrap-acl-token --template={{.data.token}} | base64 -d)
export CONSUL_HTTP_SSL=true
export CONSUL_HTTP_SSL_VERIFY=false
  • enable a port forward to the consul server on port 8501
  • run consul acl role list and you will see both terminating gateways with the term gateway policy and the zoidberg and nibbler policies
  • run consul acl policy read -name zoidberg-write-policy to see the policy include the namespace
  • shell into the bender service and run curl localhost:1234 to see the request to zoidberg go through, run curl localhost:5678 to see the request to nibbler go through
  • now we'll test that acl policies are removed correctly
  • edit the termgw.yaml file to no longer reference the zoidberg service and apply the file
  • run consul acl role list and see the first terminating gateway no longer references the zoidberg policy
  • run consul acl policy list and you'll see the zoidberg policy still exists
  • now edit the termgw2.yaml file to no longer reference the zoidberg service and apply that file
  • now run consul acl role list and see that none of the terminating gateways reference that policy
  • now run consul acl policy list and see that the zoidberg policy no longer exists because no gateway is referencing it

Checklist

@jm96441n jm96441n added pr/no-changelog PR does not need a corresponding .changelog entry backport/1.5.x labels Jul 31, 2024
@@ -165,10 +165,10 @@ func CheckStaticServerConnectionSuccessfulWithMessage(t *testing.T, options *k8s

// CheckStaticServerConnectionSuccessful is just like CheckStaticServerConnection
// but it always expects a successful connection.
func CheckStaticServerConnectionSuccessful(t *testing.T, options *k8s.KubectlOptions, sourceApp string, curlArgs ...string) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified this argument to make it more clear what options it should be for (in multi-namespace scenarios)

// WaitForInput starts a http server on a random port (which is output in the logs) and waits until you
// issue a request to that endpoint to continue the tests. This is useful for debugging tests that require
// inspecting the current state of a running cluster and you don't need to use long sleeps
func WaitForInput(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was super useful for debugging the state of the cluster during tests and not needing to put a random sleep time in the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this into it's own PR #4225

@@ -231,9 +227,9 @@ func (c *RegistrationCache) updateTermGWACLRole(log logr.Logger, registration *v
var data bytes.Buffer
if err := gatewayTpl.Execute(&data, templateArgs{
EnablePartitions: c.partitionsEnabled,
Partition: registration.Spec.Service.Partition,
Partition: defaultIfEmpty(registration.Spec.Service.Partition),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these bits here fix the bug, the rest is just updating tests

@jm96441n jm96441n force-pushed the NET-10567-fix-namespace-normalization-registration branch from d05eb97 to bd7508f Compare August 5, 2024 14:42
@jm96441n jm96441n removed the pr/no-changelog PR does not need a corresponding .changelog entry label Aug 5, 2024
@jm96441n jm96441n requested review from a team, NiniOak and xwa153 and removed request for a team August 5, 2024 15:07
@nathancoleman nathancoleman self-requested a review August 5, 2024 19:11
@@ -90,6 +90,10 @@ func TestControllerNamespaces(t *testing.T) {

"global.acls.manageSystemACLs": strconv.FormatBool(c.secure),
"global.tls.enabled": strconv.FormatBool(c.secure),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test exercises the terminating gateway path and those don't function correctly without this in your helm chart

@jm96441n jm96441n changed the title [NET-10567] Fix namespace normalization on external registration [NET-10567] Fix namespace normalization on external registration/ACL Setup for Terminating Gateways Aug 12, 2024
"sigs.k8s.io/controller-runtime/pkg/client"
)

const NotInServiceMeshFilter = "ServiceMeta[\"managed-by\"] != \"consul-k8s-endpoints-controller\""

func init() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this moved to the term gateway controller


if err := c.k8sClient.Get(c.ctx, types.NamespacedName{Name: svc, Namespace: namespace}, registration); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the service name here doesn't necessarily map directly to the registration name

@@ -44,12 +44,12 @@ type RegistrationsController struct {
Log logr.Logger
}

// +kubebuilder:rbac:groups=consul.hashicorp.com,resources=servicerouters,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=consul.hashicorp.com,resources=servicerouters/status,verbs=get;update;patch
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found this copy/pasta mistake


func (r *RegistrationsController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := r.Log.V(1).WithValues("registration", req.NamespacedName)
log.Info("Reconciling Registaration")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing typo

t.Run(c.kubeKind, func(t *testing.T) {
req := require.New(t)
ctx := context.Background()
for _, secure := range []bool{true, false} {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated these tests to run in secure and non secure mode to ensure config entries are created when acls are enabled

}

t.Log("input received, continuing test")
go func() {
Copy link
Member Author

@jm96441n jm96441n Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was causing the caller to hang if we didn't do it async

@@ -60,18 +60,6 @@ func (r *RegistrationsController) Reconcile(ctx context.Context, req ctrl.Reques
return ctrl.Result{}, client.IgnoreNotFound(err)
}

cachedRegistration, ok := r.Cache.get(registration.Spec.Service.Name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this down so we handle deletion correctly

@@ -143,48 +143,9 @@ func (r *RegistrationsController) handleRegistration(ctx context.Context, log lo
return result
}

if r.Cache.aclsEnabled() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the acl updates are done on the terminating gateway now to remove this controller racing with that one

@jm96441n jm96441n requested review from a team and removed request for a team August 16, 2024 19:57
Copy link
Member

@nathancoleman nathancoleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts after walking through the PR with @jm96441n on a call a few minutes ago

jm96441n and others added 4 commits August 20, 2024 14:04
@jm96441n jm96441n enabled auto-merge (squash) August 20, 2024 20:04
@jm96441n jm96441n merged commit b138a4a into main Aug 20, 2024
50 checks passed
@jm96441n jm96441n deleted the NET-10567-fix-namespace-normalization-registration branch August 20, 2024 23:44
jm96441n added a commit that referenced this pull request Aug 21, 2024
…Setup for Terminating Gateways (#4224)

* fix bug in external service registration ACL creation where namespace is
left emtpy in acl policy if not specified in the CRD which results in an
invalid acl policy

* Remove check for timestamp

* update tests!

* update to use helper function

* all non default working

* test cases all working

* move wait for it to separate PR

* use replace for consul-k8s control-plane

* update single namespace test

* updated namespaces and destinations test

* remove usage of creating terminating gateway config entry creation and
external service config entry registration from tests

* fix typo

* update comment

* comment out broken test for the time being

* remove unused import and add period to comment

* add changelog

* fix bug in cache creation for registrations, still debugging issue with
termianting gateways and acl roles

* fix issue with terminating gateway acl role by moving role modification from registrations controller to terminating gateway controller

* appease the linter

* add acl status condition to terminating gateways

* linter

* update config entry terminating gateway tests

* Use more robust method of checking if acls are enabled

* update config entries controller unit tests to run with acls and without

* fix config entries namespaces test setup

* fix unused import

* fix config entries main test

* remove block for deregistering service

* fix comment

* fix acceptance test registration

* handle removing policies when no other gateways reference  them

* fix terminating gateway configuration for peering connect test

* remove unnecessary nodeMeta on fixture, remove unused yaml files from
fixtures

* fix wildcard service names

* use more specific matchers to avoid potential substring collisions

* Update .changelog/4224.txt

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>

* cleaning up from PR review: moving template execution to where it's
needed and updating variable names to be more consistent

* add comment

* fix typo

---------

Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
jm96441n added a commit that referenced this pull request Aug 21, 2024
…tration/ACL Setup for Terminating Gateways into release/1.5.x (#4259)

[NET-10567] Fix namespace normalization on external registration/ACL Setup for Terminating Gateways (#4224)

* fix bug in external service registration ACL creation where namespace is
left emtpy in acl policy if not specified in the CRD which results in an
invalid acl policy

* Remove check for timestamp

* update tests!

* update to use helper function

* all non default working

* test cases all working

* move wait for it to separate PR

* use replace for consul-k8s control-plane

* update single namespace test

* updated namespaces and destinations test

* remove usage of creating terminating gateway config entry creation and
external service config entry registration from tests

* fix typo

* update comment

* comment out broken test for the time being

* remove unused import and add period to comment

* add changelog

* fix bug in cache creation for registrations, still debugging issue with
termianting gateways and acl roles

* fix issue with terminating gateway acl role by moving role modification from registrations controller to terminating gateway controller

* appease the linter

* add acl status condition to terminating gateways

* linter

* update config entry terminating gateway tests

* Use more robust method of checking if acls are enabled

* update config entries controller unit tests to run with acls and without

* fix config entries namespaces test setup

* fix unused import

* fix config entries main test

* remove block for deregistering service

* fix comment

* fix acceptance test registration

* handle removing policies when no other gateways reference  them

* fix terminating gateway configuration for peering connect test

* remove unnecessary nodeMeta on fixture, remove unused yaml files from
fixtures

* fix wildcard service names

* use more specific matchers to avoid potential substring collisions

* Update .changelog/4224.txt



* cleaning up from PR review: moving template execution to where it's
needed and updating variable names to be more consistent

* add comment

* fix typo

---------

Co-authored-by: John Maguire <john.maguire@hashicorp.com>
Co-authored-by: Nathan Coleman <nathan.coleman@hashicorp.com>
@natemollica-nm
Copy link
Contributor

natemollica-nm commented Aug 22, 2024

@jm96441n -- Awesome update to the TGW workflow!

Can you ensure that this gets updated though to account for enabling Admin Partitions -- Testing this I had to manually (tproxy destinations method) update the ACL policy for one of my external services from:

namespace "default" {
  service "example" {
    policy = "write"
  }
}

to

partition "default" {
  namespace "default" {
    node_prefix "" {
      policy = "read"
    }
    service "example" {
      policy    = "write"
      intention = "read"
    }
  }
}

Additionally, I'm positive the node:read and intention:read permissions are required for these policies (which is definitely an inconsistency in our docs):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants