Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow node-to-node TCP traffic on all ports #54

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

jake-low
Copy link
Contributor

@jake-low jake-low commented Jan 7, 2025

When I merged #52 today, it caused an outage, so I quickly reverted it (#53). However, the outage wasn't directly caused by my configuration change. What I determined before the rollback was that the ingress-nginx and osmcha-app pods had been deployed to different k8s nodes, and couldn't communicate.

After much debugging today, I found that the default security group settings configured by the Terraform AWS EKS module we use permit inter-node traffic only on unprivileged ports (> 1024). The ingress-nginx pod needs to connect to osmcha-app on port 80, and when the two happened to be deployed on different nodes, the firewall settings were blocking this traffic.

My rollback happened to cause the two pods to once again be scheduled on the same node, restoring service. I suspect we haven't hit this issue in the past because the cluster was in a degraded state (which I noticed and fixed last week) where only one of the two allocated nodes was actually joined to the cluster.

This PR adds a security group rule to allow all traffic between worker nodes (including on privileged ports), which I believe should mean the app can run even when the ingress-nginx and osmcha-app pods are scheduled on different nodes.

@jake-low jake-low force-pushed the jlow/allow-internode-tcp branch from 931cd5e to 232572b Compare January 7, 2025 00:27
Copy link

github-actions bot commented Jan 7, 2025

Terraform Format and Style 🖌success

Terraform Initialization ⚙️success

Terraform Plan 📖success

Terraform Validation 🤖success

Show Plan

terraform
module.resources.random_password.django_secret_key: Refreshing state... [id=none]
module.resources.module.eks.module.kms.data.aws_partition.current[0]: Reading...
module.resources.data.aws_caller_identity.current: Reading...
module.resources.module.eks.aws_cloudwatch_log_group.this[0]: Refreshing state... [id=/aws/eks/osmcha-production-cluster/cluster]
module.resources.data.aws_availability_zones.available: Reading...
module.resources.module.eks.module.kms.data.aws_partition.current[0]: Read complete after 0s [id=aws]
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_caller_identity.current: Reading...
module.resources.module.eks.data.aws_caller_identity.current: Reading...
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_caller_identity.current: Reading...
module.resources.module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-0ff6ba7829e56e010]
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_partition.current: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_partition.current: Reading...
module.resources.module.eks.data.aws_partition.current: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_partition.current: Read complete after 0s [id=aws]
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_partition.current: Read complete after 0s [id=aws]
module.resources.module.eks.data.aws_partition.current: Read complete after 0s [id=aws]
module.resources.module.eks.module.kms.data.aws_caller_identity.current[0]: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=2560088296]
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=2560088296]
module.resources.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_iam_role.this[0]: Refreshing state... [id=regular-eks-node-group-20231107054922197700000001]
module.resources.module.eks.module.eks_managed_node_group["default"].data.aws_caller_identity.current: Read complete after 0s [id=003081160852]
module.resources.module.eks.data.aws_iam_policy_document.assume_role_policy[0]: Read complete after 0s [id=2764486067]
module.resources.data.aws_caller_identity.current: Read complete after 0s [id=003081160852]
module.resources.module.eks.data.aws_caller_identity.current: Read complete after 0s [id=003081160852]
module.resources.module.eks.module.kms.data.aws_caller_identity.current[0]: Read complete after 0s [id=003081160852]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_iam_role.this[0]: Refreshing state... [id=default-eks-node-group-20231107054922198000000002]
module.resources.module.eks.aws_iam_role.this[0]: Refreshing state... [id=osmcha-production-cluster-cluster-20231107054922198200000003]
module.resources.module.eks.data.aws_iam_session_context.current: Reading...
module.resources.module.eks.module.eks_managed_node_group["regular"].data.aws_caller_identity.current: Read complete after 0s [id=003081160852]
module.resources.module.eks.data.aws_iam_session_context.current: Read complete after 0s [id=arn:aws:iam::003081160852:user/devseed]
module.resources.data.aws_availability_zones.available: Read complete after 0s [id=us-east-1]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_iam_role_policy_attachment.additional["AmazonEBSCSIDriverPolicy"]: Refreshing state... [id=default-eks-node-group-20231107054922198000000002-20231112185603604700000001]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_iam_role_policy_attachment.additional["AmazonEBSCSIDriverPolicy"]: Refreshing state... [id=regular-eks-node-group-20231107054922197700000001-20231112185603642800000002]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"]: Refreshing state... [id=default-eks-node-group-20231107054922198000000002-20231107054925140200000009]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"]: Refreshing state... [id=regular-eks-node-group-20231107054922197700000001-20231107054924744600000005]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"]: Refreshing state... [id=regular-eks-node-group-20231107054922197700000001-20231107054924700700000004]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"]: Refreshing state... [id=default-eks-node-group-20231107054922198000000002-20231107054925026700000007]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"]: Refreshing state... [id=default-eks-node-group-20231107054922198000000002-20231107054925077200000008]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_iam_role_policy_attachment.this["arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"]: Refreshing state... [id=regular-eks-node-group-20231107054922197700000001-20231107054924804200000006]
module.resources.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSClusterPolicy"]: Refreshing state... [id=osmcha-production-cluster-cluster-20231107054922198200000003-2023110705492595170000000b]
module.resources.module.eks.aws_iam_role_policy_attachment.this["AmazonEKSVPCResourceController"]: Refreshing state... [id=osmcha-production-cluster-cluster-20231107054922198200000003-2023110705492594830000000a]
module.resources.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Reading...
module.resources.module.eks.module.kms.data.aws_iam_policy_document.this[0]: Read complete after 0s [id=2924970584]
module.resources.module.eks.module.kms.aws_kms_key.this[0]: Refreshing state... [id=64f7898b-3ee0-4a8b-a0d9-f9861be71e5a]
module.resources.module.eks.module.kms.aws_kms_alias.this["cluster"]: Refreshing state... [id=alias/eks/osmcha-production-cluster]
module.resources.module.eks.aws_iam_policy.cluster_encryption[0]: Refreshing state... [id=arn:aws:iam::003081160852:policy/osmcha-production-cluster-cluster-ClusterEncryption2023110705494813260000000f]
module.resources.module.eks.aws_iam_role_policy_attachment.cluster_encryption[0]: Refreshing state... [id=osmcha-production-cluster-cluster-20231107054922198200000003-20231107054949417900000010]
module.resources.module.vpc.aws_default_security_group.this[0]: Refreshing state... [id=sg-035e596106073cf62]
module.resources.module.eks.aws_security_group.cluster[0]: Refreshing state... [id=sg-0192096028bce8b9c]
module.resources.module.vpc.aws_default_route_table.default[0]: Refreshing state... [id=rtb-09de68292979ea4bc]
module.resources.module.vpc.aws_route_table.public[0]: Refreshing state... [id=rtb-09b6383788590fd4c]
module.resources.module.vpc.aws_subnet.public[0]: Refreshing state... [id=subnet-05462743417c78890]
module.resources.module.vpc.aws_default_network_acl.this[0]: Refreshing state... [id=acl-096d4204f9df81729]
module.resources.module.vpc.aws_subnet.public[2]: Refreshing state... [id=subnet-027a255ffd5487e5b]
module.resources.module.vpc.aws_subnet.public[1]: Refreshing state... [id=subnet-0444a55ffd754d386]
module.resources.module.vpc.aws_route_table.private[0]: Refreshing state... [id=rtb-08f840c254d408f66]
module.resources.module.eks.aws_security_group.node[0]: Refreshing state... [id=sg-0fa5f0d23a3464a65]
module.resources.module.vpc.aws_subnet.private[0]: Refreshing state... [id=subnet-0225432bfc41c3460]
module.resources.module.vpc.aws_subnet.private[1]: Refreshing state... [id=subnet-01379fd977a915c30]
module.resources.module.vpc.aws_subnet.private[2]: Refreshing state... [id=subnet-0b0cbed43c2c6d28b]
module.resources.module.vpc.aws_internet_gateway.this[0]: Refreshing state... [id=igw-001ac74d2eda24d7c]
module.resources.module.vpc.aws_route_table_association.public[1]: Refreshing state... [id=rtbassoc-0d7852c3624f5fa5c]
module.resources.module.vpc.aws_route_table_association.public[2]: Refreshing state... [id=rtbassoc-02d547b8c1ca2216c]
module.resources.module.vpc.aws_route_table_association.public[0]: Refreshing state... [id=rtbassoc-00993017897174a09]
module.resources.module.vpc.aws_route.public_internet_gateway[0]: Refreshing state... [id=r-rtb-09b6383788590fd4c1080289494]
module.resources.module.vpc.aws_eip.nat[0]: Refreshing state... [id=eipalloc-00acc931ff3ef4897]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_9443_webhook"]: Refreshing state... [id=sgrule-3093590831]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_kubelet"]: Refreshing state... [id=sgrule-2156400892]
module.resources.module.eks.aws_security_group_rule.node["ingress_self_coredns_tcp"]: Refreshing state... [id=sgrule-399378335]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_4443_webhook"]: Refreshing state... [id=sgrule-115022675]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_6443_webhook"]: Refreshing state... [id=sgrule-3482942661]
module.resources.module.eks.aws_security_group_rule.node["ingress_nodes_ephemeral"]: Refreshing state... [id=sgrule-3774878186]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_8443_webhook"]: Refreshing state... [id=sgrule-3703914724]
module.resources.module.eks.aws_security_group_rule.node["ingress_self_coredns_udp"]: Refreshing state... [id=sgrule-3012666480]
module.resources.module.eks.aws_security_group_rule.node["ingress_cluster_443"]: Refreshing state... [id=sgrule-4155399891]
module.resources.module.eks.aws_security_group_rule.node["egress_all"]: Refreshing state... [id=sgrule-2348016402]
module.resources.module.eks.aws_security_group_rule.cluster["ingress_nodes_443"]: Refreshing state... [id=sgrule-274735370]
module.resources.module.vpc.aws_route_table_association.private[2]: Refreshing state... [id=rtbassoc-032b6d59b2151c32c]
module.resources.module.vpc.aws_route_table_association.private[1]: Refreshing state... [id=rtbassoc-0ebaaeebe09f17002]
module.resources.module.vpc.aws_route_table_association.private[0]: Refreshing state... [id=rtbassoc-01908ea82222da23e]
module.resources.module.vpc.aws_nat_gateway.this[0]: Refreshing state... [id=nat-0fcff0cb111a56930]
module.resources.module.vpc.aws_route.private_nat_gateway[0]: Refreshing state... [id=r-rtb-08f840c254d408f661080289494]
module.resources.module.eks.aws_eks_cluster.this[0]: Refreshing state... [id=osmcha-production-cluster]
module.resources.module.eks.time_sleep.this[0]: Refreshing state... [id=2025-01-03T00:29:26Z]
module.resources.module.eks.data.aws_eks_addon_version.this["aws-ebs-csi-driver"]: Reading...
module.resources.module.eks.data.tls_certificate.this[0]: Reading...
module.resources.module.eks.data.tls_certificate.this[0]: Read complete after 0s [id=99d41e43229a4cdaf4141f3e8310e6d95c31dab9]
module.resources.module.eks.aws_iam_openid_connect_provider.oidc_provider[0]: Refreshing state... [id=arn:aws:iam::003081160852:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/C8E12DA86808CD1A27FA43DCE7E709B2]
module.resources.kubernetes_secret.django_secret_key: Refreshing state... [id=default/django-secret-key]
module.resources.helm_release.osmcha-cert-manager: Refreshing state... [id=cert-manager]
module.resources.helm_release.osmcha-redis: Refreshing state... [id=redis]
module.resources.helm_release.osmcha-ingress-nginx: Refreshing state... [id=ingress-nginx]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_launch_template.this[0]: Refreshing state... [id=lt-06fc194f152986836]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_launch_template.this[0]: Refreshing state... [id=lt-0e8461b46962b2afb]
module.resources.module.eks.data.aws_eks_addon_version.this["aws-ebs-csi-driver"]: Read complete after 0s [id=aws-ebs-csi-driver]
module.resources.module.eks.module.eks_managed_node_group["default"].aws_eks_node_group.this[0]: Refreshing state... [id=osmcha-production-cluster:default-20231107060014586500000017]
module.resources.module.eks.module.eks_managed_node_group["regular"].aws_eks_node_group.this[0]: Refreshing state... [id=osmcha-production-cluster:regular-20231107060014586200000015]
module.resources.module.eks.aws_eks_addon.this["aws-ebs-csi-driver"]: Refreshing state... [id=osmcha-production-cluster:aws-ebs-csi-driver]

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.resources.module.eks.aws_security_group_rule.node["ingress_self_all"] will be created
  + resource "aws_security_group_rule" "node" {
      + description              = "Node to node ingress on all ports (default only permits ingress on unprivileged ports)"
      + from_port                = 1
      + id                       = (known after apply)
      + prefix_list_ids          = []
      + protocol                 = "tcp"
      + security_group_id        = "sg-0fa5f0d23a3464a65"
      + security_group_rule_id   = (known after apply)
      + self                     = true
      + source_security_group_id = (known after apply)
      + to_port                  = 65535
      + type                     = "ingress"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Warning: Argument is deprecated

  with module.resources.module.eks.aws_eks_addon.this["aws-ebs-csi-driver"],
  on .terraform/modules/resources.eks/main.tf line 392, in resource "aws_eks_addon" "this":
 392:   resolve_conflicts        = try(each.value.resolve_conflicts, "OVERWRITE")

The "resolve_conflicts" attribute can't be set to "PRESERVE" on initial
resource creation. Use "resolve_conflicts_on_create" and/or
"resolve_conflicts_on_update" instead

─────────────────────────────────────────────────────────────────────────────

Note: You didn't use the -out option to save this plan, so Terraform can't
guarantee to take exactly these actions if you run "terraform apply" now.

Pusher: @jake-low, Action: pull_request

@jake-low jake-low merged commit e25b7e2 into main Jan 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant