Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM after moving to consul-k8 0.16.0 #283

Closed
fredstanley opened this issue Jun 26, 2020 · 10 comments · Fixed by #291
Closed

OOM after moving to consul-k8 0.16.0 #283

fredstanley opened this issue Jun 26, 2020 · 10 comments · Fixed by #291
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working

Comments

@fredstanley
Copy link

fredstanley commented Jun 26, 2020

After moving to latest consul-k8 ver 0.16.0. I see the resource allocation limits has been added to init containers. This causes OOM in the "cp". I have attached the related logs.

Error:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Controlled By: ReplicaSet/apiserver-6dbcb68646
Init Containers:
consul-connect-inject-init:
Container ID: docker://13f0315c75d84b4b7c58409174974bf9f4f876abc234fce595d14af1efecc4b9
Image: consul:1.8.0
Image ID: docker-pullable://consul@sha256:0e660ca8ae28d864e3eaaed0e273b2f8cd348af207e2b715237e869d7a8b5dcc
Port:
Host Port:
Command:
/bin/sh
-ec

  export CONSUL_HTTP_ADDR="${HOST_IP}:8500"
  export CONSUL_GRPC_ADDR="${HOST_IP}:8502"

  # Register the service. The HCL is stored in the volume so that
  # the preStop hook can access it to deregister the service.
  cat <<EOF >/consul/connect-inject/service.hcl
  services {
    id   = "${PROXY_SERVICE_ID}"
    name = "apiserver-sidecar-proxy"
    kind = "connect-proxy"
    address = "${POD_IP}"
    port = 20000
    meta = {
      pod-name = "${POD_NAME}"
    }

    proxy {
      destination_service_name = "apiserver"
      destination_service_id = "${SERVICE_ID}"
      local_service_address = "127.0.0.1"
      local_service_port = 9086
      }

    checks {
      name = "Proxy Public Listener"
      tcp = "${POD_IP}:20000"
      interval = "10s"
      deregister_critical_service_after = "10m"
    }

    checks {
      name = "Destination Alias"
      alias_service = "${SERVICE_ID}"
    }
  }

  services {
    id   = "${SERVICE_ID}"
    name = "apiserver"
    address = "${POD_IP}"
    port = 9086
    meta = {
      pod-name = "${POD_NAME}"
    }
  }
  EOF
  # Create the service-defaults config for the service
  cat <<EOF >/consul/connect-inject/service-defaults.hcl
  kind = "service-defaults"
  name = "apiserver"
  protocol = "grpc"
  EOF
  /bin/consul config write -cas -modify-index 0 \
    /consul/connect-inject/service-defaults.hcl || true

  /bin/consul services register \
    /consul/connect-inject/service.hcl

  # Generate the envoy bootstrap code
  /bin/consul connect envoy \
    -proxy-id="${PROXY_SERVICE_ID}" \
    -bootstrap > /consul/connect-inject/envoy-bootstrap.yaml

  # Copy the Consul binary
  cp /bin/consul /consul/connect-inject/consul
State:          Running
  Started:      Fri, 26 Jun 2020 08:22:17 -0700
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Fri, 26 Jun 2020 08:21:39 -0700
  Finished:     Fri, 26 Jun 2020 08:22:01 -0700
Ready:          False
Restart Count:  2
Limits:
  cpu:     50m
  memory:  25Mi
Requests:
  cpu:     50m
  memory:  25Mi
Environment:
  HOST_IP:            (v1:status.hostIP)
  POD_IP:             (v1:status.podIP)
  POD_NAME:          apiserver-6dbcb68646-569z4 (v1:metadata.name)
  POD_NAMESPACE:     default (v1:metadata.namespace)
  SERVICE_ID:        $(POD_NAME)-apiserver
  PROXY_SERVICE_ID:  $(POD_NAME)-apiserver-sidecar-proxy
Mounts:
  /consul/connect-inject from consul-connect-inject-data (rw)
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-cdwdf (ro)
dmesg
[ 4162.609188] cp invoked oom-killer: gfp_mask=0x101cca(GFP_HIGHUSER_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=999
[ 4162.609194] CPU: 3 PID: 179518 Comm: cp Not tainted 5.4.0-37-generic #41-Ubuntu
[ 4162.609196] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[ 4162.609197] Call Trace:
[ 4162.609224]  dump_stack+0x6d/0x9a
[ 4162.609229]  dump_header+0x4f/0x1eb
[ 4162.609231]  oom_kill_process.cold+0xb/0x10
[ 4162.609233]  out_of_memory.part.0+0x1df/0x3d0
[ 4162.609236]  out_of_memory+0x6d/0xd0
[ 4162.609243]  mem_cgroup_out_of_memory+0xbd/0xe0
[ 4162.609249]  try_charge+0x775/0x800
[ 4162.609255]  ? prep_new_page+0x128/0x160
[ 4162.609260]  mem_cgroup_try_charge+0x71/0x190
[ 4162.609266]  __add_to_page_cache_locked+0x265/0x340
[ 4162.609270]  ? scan_shadow_nodes+0x30/0x30
[ 4162.609284]  add_to_page_cache_lru+0x4d/0xd0
[ 4162.609288]  pagecache_get_page+0x101/0x300
[ 4162.609292]  grab_cache_page_write_begin+0x21/0x40
[ 4162.609300]  ext4_da_write_begin+0x10d/0x460
[ 4162.609304]  generic_perform_write+0xc2/0x1c0
[ 4162.609311]  ? file_update_time+0x62/0x140
[ 4162.609315]  __generic_file_write_iter+0x107/0x1d0
[ 4162.609319]  ext4_file_write_iter+0xb9/0x360
[ 4162.609326]  ? common_file_perm+0x5e/0x110
[ 4162.609330]  do_iter_readv_writev+0x14f/0x1d0
[ 4162.609334]  do_iter_write+0x84/0x1a0
[ 4162.609337]  vfs_iter_write+0x19/0x30
[ 4162.609353]  iter_file_splice_write+0x24d/0x390
[ 4162.609359]  direct_splice_actor+0x39/0x40
[ 4162.609361]  splice_direct_to_actor+0xf5/0x240
[ 4162.609364]  ? do_splice_from+0x30/0x30
[ 4162.609366]  do_splice_direct+0x89/0xd0
[ 4162.609371]  do_sendfile+0x1b1/0x3e0
[ 4162.609375]  __x64_sys_sendfile64+0xa6/0xc0
[ 4162.609383]  do_syscall_64+0x57/0x190
[ 4162.609391]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 4162.609395] RIP: 0033:0x7f936af5182c
[ 4162.609401] Code: c3 48 85 ff 74 0c 48 c7 c7 f4 ff ff ff e9 9e eb ff ff b8 0c 00 00 00 0f 05 c3 49 89 ca 48 63 f6 48 63 ff b8 28 00 00 00 0f 05 <48> 89 c7 e9 7e eb ff ff 89 ff 50 b8 7b 00 00 00 0f 05 48 89 c7 e8
[ 4162.609403] RSP: 002b:00007ffe88fdb788 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
[ 4162.609440] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f936af5182c
[ 4162.609443] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000004
[ 4162.609445] RBP: 00007ffe88fdb7f0 R08: 0000000001000000 R09: 0000000000000000
[ 4162.609446] R10: 0000000001000000 R11: 0000000000000246 R12: 0000000000000001
[ 4162.609448] R13: 0000000001000000 R14: 0000000000000004 R15: 0000000000000000
[ 4162.609453] memory: usage 25600kB, limit 25600kB, failcnt 1027
[ 4162.609455] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 4162.609457] kmem: usage 2056kB, limit 9007199254740988kB, failcnt 0
[ 4162.609458] Memory cgroup stats for /kubepods/burstable/pod2aaa68eb-4c98-4abf-955b-a0df278771d4/b754b839cb4a92512cc95a5d33375a9383ce6750e6757ce19ef2647ef0cbd096:
[ 4162.609503] anon 0
               file 23912448
               kernel_stack 36864
               slab 1515520
               sock 0
               shmem 0
               file_mapped 0
               file_dirty 23924736
               file_writeback 135168
               anon_thp 0
               inactive_anon 0
               active_anon 0
               inactive_file 11878400
               active_file 12267520
               unevictable 0
               slab_reclaimable 815104
               slab_unreclaimable 700416
               pgfault 13629
               pgmajfault 0
               workingset_refault 0
               workingset_activate 0
               workingset_nodereclaim 0
               pgrefill 92374
               pgscan 98976
               pgsteal 3029
               pgactivate 95898
               pgdeactivate 92374
               pglazyfree 297
               pglazyfreed 0
               thp_fault_alloc 0
               thp_collapse_alloc 0
[ 4162.609505] Tasks state (memory values in pages):
[ 4162.609507] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 4162.609514] [ 179518]     0 179518      381        3    32768        0           999 cp
[ 4162.609518] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=b754b839cb4a92512cc95a5d33375a9383ce6750e6757ce19ef2647ef0cbd096,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2aaa68eb-4c98-4abf-955b-a0df278771d4/b754b839cb4a92512cc95a5d33375a9383ce6750e6757ce19ef2647ef0cbd096,task_memcg=/kubepods/burstable/pod2aaa68eb-4c98-4abf-955b-a0df278771d4/b754b839cb4a92512cc95a5d33375a9383ce6750e6757ce19ef2647ef0cbd096,task=cp,pid=179518,uid=0
[ 4162.609552] Memory cgroup out of memory: Killed process 179518 (cp) total-vm:1524kB, anon-rss:12kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:32kB oom_score_adj:999
[ 4162.615609] oom_reaper: reaped process 179518 (cp), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
@lkysow lkysow added area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working labels Jun 26, 2020
@lkysow
Copy link
Member

lkysow commented Jun 26, 2020

Hi Fred, thanks for the issue.

If you need a workaround immediately you'll need to modify these values: https://github.com/hashicorp/consul-k8s/blob/master/connect-inject/container_init.go#L14-L19 and build a custom consul-k8s image.

To be clear this is a high priority issue for us and we're working on it right now.

@fredstanley
Copy link
Author

Sure. I will wait for your fix.
if possible can you also make it configurable via helm chart ?

@fredstanley
Copy link
Author

@lkysow Can you pls provide an ETA on this fix ?

@lkysow
Copy link
Member

lkysow commented Jul 7, 2020

Hey, we're actively working on it but can't provide an accurate ETA. We can get you a custom Docker image with our planned fix if that would help?

@fredstanley
Copy link
Author

Hey, we're actively working on it but can't provide an accurate ETA. We can get you a custom Docker image with our planned fix if that would help?

Yes a pointed docker image with this fix is also appreciated.

@lkysow
Copy link
Member

lkysow commented Jul 7, 2020

I built lkysow/consul-k8s-dev:jul07-resource-settings that bumps up the limit to 125Mi. Can you try that?

@fredstanley
Copy link
Author

I built lkysow/consul-k8s-dev:jul07-resource-settings that bumps up the limit to 125Mi. Can you try that?

Thanks. This helps us and works

@fredstanley
Copy link
Author

@lkysow and @kschoche Can you pls me know which official version of consul-k8s will have this fix ?

@lkysow
Copy link
Member

lkysow commented Jul 9, 2020

Yes we'll update this thread once the releases are out!

@fredstanley
Copy link
Author

@lkysow

i started using your private build
lkysow/consul-k8s-dev:jul07-resource-settings

Now in some cases i see consul hitting oom issue. My hunch is now the lifecycle container is hitting the resource limits.
can you pls increase the limit or remote the limit and just keep the request ?

const (
lifecycleContainerCPULimit = "20m"
lifecycleContainerCPURequest = "20m"
lifecycleContainerMemoryLimit = "25Mi"
lifecycleContainerMemoryRequest = "25Mi"
)

Appreciate if you can give a private fix

[66666.794649] RSP: 002b:00007fff487b8e50 EFLAGS: 00010202
[66666.794652] RAX: 000000c0001f8c00 RBX: 00007ff6138e3000 RCX: 000000c00088c000
[66666.794654] RDX: 000000c0001f8c00 RSI: 0000000006000446 RDI: 0000800000000000
[66666.794656] RBP: 00007fff487b8e80 R08: 000000c00088c000 R09: 0000000000000004
[66666.794657] R10: 00007ff6135f3bd8 R11: 0000000000000449 R12: 0000000000000003
[66666.794659] R13: 0000000000000000 R14: 000000000375e6d4 R15: 0000000000000000
[66666.794689] memory: usage 25600kB, limit 25600kB, failcnt 3986
[66666.794691] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[66666.794692] kmem: usage 8792kB, limit 9007199254740988kB, failcnt 0
[66666.794693] Memory cgroup stats for /kubepods/burstable/pod04e7f3b6-de44-476f-a139-1e2ac473c5bf/0cf41faa207b73839cb76d608582691a29b4f5de7d4148f8242891bbcb9703a2:
[66666.794718] anon 16138240
file 0
kernel_stack 294912
slab 5472256
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 405504
active_anon 16084992
inactive_file 0
active_file 0
unevictable 135168
slab_reclaimable 434176
slab_unreclaimable 5038080
pgfault 728640
pgmajfault 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
pgrefill 0
pgscan 9532
pgsteal 2009
pgactivate 3498
pgdeactivate 0
pglazyfree 25575
pglazyfreed 1947
thp_fault_alloc 0
thp_collapse_alloc 0
[66666.794720] Tasks state (memory values in pages):
[66666.794722] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[66666.794730] [2998520] 100 2998520 190368 7650 200704 0 999 consul-k8s
[66666.794736] [3351864] 100 3351864 196558 13132 266240 0 999 consul
[66666.794738] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=0cf41faa207b73839cb76d608582691a29b4f5de7d4148f8242891bbcb9703a2,mems_allowed=0-1,oom_memcg=/kubepods/burstable/pod04e7f3b6-de44-476f-a139-1e2ac473c5bf/0cf41faa207b73839cb76d608582691a29b4f5de7d4148f8242891bbcb9703a2,task_memcg=/kubepods/burstable/pod04e7f3b6-de44-476f-a139-1e2ac473c5bf/0cf41faa207b73839cb76d608582691a29b4f5de7d4148f8242891bbcb9703a2,task=consul,pid=3351864,uid=100
[66666.794823] Memory cgroup out of memory: Killed process 3351864 (consul) total-vm:786232kB, anon-rss:11028kB, file-rss:41660kB, shmem-rss:0kB, UID:100 pgtables:260kB oom_score_adj:999
[66666.811642] oom_reaper: reaped process 3351864 (consul), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connect Related to Connect service mesh, e.g. injection type/bug Something isn't working
Projects
None yet
2 participants