-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActionLimitsTests memory test failing #3246
Comments
Failure seen in https://scans.gradle.com/s/xq6t6op7ybifc/tests/xaqugzlgs2zw6-wlc37iylaskqa
|
Error logs from activation record "entityType": "activation",
"logs": [
"2018-11-12T08:32:03.61510596Z stdout: helloEatMemory memory 256MB",
"2018-11-12T08:32:03.648659155Z stderr: RangeError: Array buffer allocation failed",
"2018-11-12T08:32:03.648689686Z stderr: at Buffer.Uint8Array (native)",
"2018-11-12T08:32:03.64869488Z stderr: at FastBuffer (buffer.js:11:5)",
"2018-11-12T08:32:03.648698642Z stderr: at createUnsafeBuffer (buffer.js:38:12)",
"2018-11-12T08:32:03.648701912Z stderr: at new Buffer.alloc (buffer.js:130:12)",
"2018-11-12T08:32:03.648705242Z stderr: at eat (eval at <anonymous> (/nodejsAction/runner.js:79:109), <anonymous>:6:18)",
"2018-11-12T08:32:03.648711671Z stderr: at NodeActionRunner.main [as userScriptMain] (eval at <anonymous> (/nodejsAction/runner.js:79:109), <anonymous>:14:5)",
"2018-11-12T08:32:03.648716081Z stderr: at /nodejsAction/runner.js:98:45",
"2018-11-12T08:32:03.648719542Z stderr: at NodeActionRunner.run (/nodejsAction/runner.js:92:16)",
"2018-11-12T08:32:03.648723251Z stderr: at doRun (/nodejsAction/src/service.js:168:31)",
"2018-11-12T08:32:03.648726681Z stderr: at runCode (/nodejsAction/src/service.js:120:20)"
] |
I my local system, see controller container restarting when running these tests, leading to 502 responses to test cli and test failures. I noticed Looking at whether this is causing problems... |
In my case (macos + docker for mac) I had the docker prefs set to use 2GB; increasing to 4GB helped. I guess this is unrelated to the RangeError nodejs case, but unexpected heap settings are also bad. |
Another failure seen at https://scans.gradle.com/s/qedg2uhf76u3s/tests/xaqugzlgs2zw6-wlc37iylaskqa
|
Yet another failure https://scans.gradle.com/s/57yozr36s3y7i/tests/xaqugzlgs2zxa-wlc37iylaskqa |
@markusthoemmes This test is failing multiple times for me in my travis builds. I see this was added for #2827. From error it appears now that large array allocation failure is properly handled by Node JS runtime (may be newer version have more resiliency). Should we adapt the test case to cause true OOM is some other way? |
I guess the problem lies somewhere in the details of physical / virtual memory, container (cgroup) rss / swap memory limits and overall system memory. The exception stack starting with To explain why this error happens, let's run the
There is barely no difference between the global system and the container. Let's ignore the small differences in output for now. Even though there is a memory limit defined on the container, it seems to have 15G RAM. Containers (i.e. namespaces) do not influence the results of memory management kernel calls because there is no memory namespace [1]. The container "thinks" it has lots of memory even if there is a limit on the physical memory. If you request an amount of (virtual) memory - regardless of whether you are in a container or not -, you will receive a positive result if the kernel thinks it will be able to contain the requested memory. It won't reserve any physical memory pages for your request. [2] When receiving a As already mentioned, when you receive the desired amount of virtual memory, it's simply table entries but not backed by physical memory. If you start filling the memory, the memory management system will receive page faults whenever you touch a virtual memory page for the first time and provide a physical page backing the virtual page. This is where the action container limits apply: The memory subsystem has an accounting how much physical memory is owned by the container. Once the actual physical memory reaches the defined limit, the memory subsystem tries to reduce physical memory pressure of the container (cgroup). The final action is to invoke the out-of-memory (OOM) killer which stops the process with the highest score. The process consuming most physical memory usually has the highest score. BTW: if your action code starts a child process which consumes more physical memory than your main process, the OOM killer will stop the child process. At the same time, the main action process holding the communication socket to the invoker will continue to run normally. You won't see an OOM error of the action. Another interesting detail: some operating systems and VMs perform memory de-deplication [3]. If you allocate a large buffer and fill all virtual memory pages with the same values, it may end up with a single physical page representing all required virtual memory pages. That's why our current action [4] may fail to provoke the desired OOM kill in some environments. Based on these details, I suggest to follow this strategy:
Nevertheless, if the system has too few memory, this strategy won't help either. [1] http://man7.org/linux/man-pages/man7/namespaces.7.html |
Great analysis @sven-lange-last !! Would try to come up with a PR matching the proposed approach (unless you are working on the same ... in that case let me know) |
@chetanmeh feel free to prepare a PR. I can review. At the moment, I'm busy looking into other areas. |
On my local Ubuntu box
The test case does not pass i.e. no memory error seen at all no matter what memory I use. Changed the function eat(memoryMB) {
var bytes = 1*1024*1024*memoryMB;
var pageSize = 1024;
var buffer = new Buffer.alloc(bytes);
for (i = 0; i < bytes - 4; i+= pageSize) {
buffer.writeFloatBE(Math.random(), i)
}
console.log('done.' + buffer.length);
} Below is the docker launch command which shows that memory option being passed is
Would now check the cgroup stats to see why limits are not being enforced |
Below is the activation record {
"activationId": "e75b69aea9fe4dc09b69aea9fe5dc0ae",
"annotations": [
{
"key": "limits",
"value": {
"concurrency": 1,
"logs": 10,
"memory": 128,
"timeout": 60000
}
},
{
"key": "path",
"value": "guest/TestNodeJsMemoryExceeding"
},
{
"key": "kind",
"value": "nodejs:6"
},
{
"key": "waitTime",
"value": 242
}
],
"duration": 2909,
"end": 1543655616203,
"logs": [
"2018-12-01T09:13:33.59781017Z stdout: helloEatMemory memory 256MB",
"2018-12-01T09:13:35.993134294Z stdout: <Buffer 3f 5e f5 3c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... >"
],
"name": "TestNodeJsMemoryExceeding",
"namespace": "guest",
"publish": false,
"response": {
"result": {
"msg": "OK, buffer of size 256 MB has been filled."
},
"status": "success",
"success": true
},
"start": 1543655613294,
"subject": "guest",
"version": "0.0.1"
} As can be seen that memory is set to Below are the cgroup stats $ docker ps | grep Memory
3662a758926e openwhisk/nodejs6action:latest "/bin/sh -c 'node --…" 13 minutes ago Up 13 minutes (Paused) wsk00_20_guest_TestNodeJsMemoryExceeding And here are the various stats /sys/fs/cgroup/memory/docker/3662a758926ec4013648222c865996af2e0146140e7ee3b75999b534bc46a4d4$ tail -n +1 *
==> cgroup.clone_children <==
0
tail: cannot open 'cgroup.event_control' for reading: Permission denied
==> cgroup.procs <==
698
751
==> memory.failcnt <==
28917
tail: cannot open 'memory.force_empty' for reading: Permission denied
==> memory.kmem.failcnt <==
0
==> memory.kmem.limit_in_bytes <==
9223372036854771712
==> memory.kmem.max_usage_in_bytes <==
2498560
==> memory.kmem.slabinfo <==
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc-1024 32 32 1024 32 8 : tunables 0 0 0 : slabdata 1 1 0
kmalloc-192 21 21 192 21 1 : tunables 0 0 0 : slabdata 1 1 0
kmalloc-8 0 0 8 512 1 : tunables 0 0 0 : slabdata 0 0 0
radix_tree_node 84 84 584 28 4 : tunables 0 0 0 : slabdata 3 3 0
inode_cache 78 78 608 26 4 : tunables 0 0 0 : slabdata 3 3 0
shmem_inode_cache 69 69 712 23 4 : tunables 0 0 0 : slabdata 3 3 0
eventpoll_pwq 168 168 72 56 1 : tunables 0 0 0 : slabdata 3 3 0
eventpoll_epi 128 128 128 32 1 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-512 128 128 512 32 4 : tunables 0 0 0 : slabdata 4 4 0
kmalloc-256 128 128 256 32 2 : tunables 0 0 0 : slabdata 4 4 0
mm_struct 60 60 2112 15 8 : tunables 0 0 0 : slabdata 4 4 0
signal_cache 64 64 1024 32 8 : tunables 0 0 0 : slabdata 2 2 0
sighand_cache 30 30 2112 15 8 : tunables 0 0 0 : slabdata 2 2 0
files_cache 69 69 704 23 4 : tunables 0 0 0 : slabdata 3 3 0
mqueue_inode_cache 0 0 960 34 8 : tunables 0 0 0 : slabdata 0 0 0
cred_jar 210 210 192 21 1 : tunables 0 0 0 : slabdata 10 10 0
task_struct 30 30 5952 5 8 : tunables 0 0 0 : slabdata 6 6 0
sock_inode_cache 115 115 704 23 4 : tunables 0 0 0 : slabdata 5 5 0
proc_inode_cache 168 168 680 24 4 : tunables 0 0 0 : slabdata 7 7 0
dentry 349 420 192 21 1 : tunables 0 0 0 : slabdata 20 20 0
filp 352 352 256 32 2 : tunables 0 0 0 : slabdata 11 11 0
anon_vma 414 414 88 46 1 : tunables 0 0 0 : slabdata 9 9 0
pid 576 576 64 64 1 : tunables 0 0 0 : slabdata 9 9 0
vm_area_struct 390 390 208 39 2 : tunables 0 0 0 : slabdata 10 10 0
==> memory.kmem.tcp.failcnt <==
0
==> memory.kmem.tcp.limit_in_bytes <==
9223372036854771712
==> memory.kmem.tcp.max_usage_in_bytes <==
0
==> memory.kmem.tcp.usage_in_bytes <==
0
==> memory.kmem.usage_in_bytes <==
2400256
==> memory.limit_in_bytes <==
134217728
==> memory.max_usage_in_bytes <==
134217728
==> memory.move_charge_at_immigrate <==
0
==> memory.numa_stat <==
total=32126 N0=32126
file=0 N0=0
anon=32126 N0=32126
unevictable=0 N0=0
hierarchical_total=32126 N0=32126
hierarchical_file=0 N0=0
hierarchical_anon=32126 N0=32126
hierarchical_unevictable=0 N0=0
==> memory.oom_control <==
oom_kill_disable 0
under_oom 0
oom_kill 0
tail: cannot open 'memory.pressure_level' for reading: Permission denied
==> memory.soft_limit_in_bytes <==
9223372036854771712
==> memory.stat <==
cache 0
rss 131588096
rss_huge 0
shmem 0
mapped_file 0
dirty 0
writeback 0
pgpgin 140710
pgpgout 108584
pgfault 145356
pgmajfault 2051
inactive_anon 65839104
active_anon 65748992
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 134217728
total_cache 0
total_rss 131588096
total_rss_huge 0
total_shmem 0
total_mapped_file 0
total_dirty 0
total_writeback 0
total_pgpgin 140710
total_pgpgout 108584
total_pgfault 145356
total_pgmajfault 2051
total_inactive_anon 65839104
total_active_anon 65748992
total_inactive_file 0
total_active_file 0
total_unevictable 0
==> memory.swappiness <==
60
==> memory.usage_in_bytes <==
133988352
==> memory.use_hierarchy <==
1
==> notify_on_release <==
0
==> tasks <==
698
751
753
754
755
756
757 Per stats the |
Ran a simple travis test based on this test repo which executes nodejs docker image with same script which we pass with test and it fails as expected with 137 exit code and "OOMKilled": true language: python
group: deprecated-2017Q3
services:
- docker
script:
- grep CGROUP /boot/config-$(uname -r)
- docker run --name oom-check-node --memory 128m --memory-swap 128m -v "$PWD":/usr/src/app -w /usr/src/app openwhisk/nodejs6action node memoryWithGC.js
- docker inspect oom-check-node
- docker logs oom-check-node So not sure why it fails in some cases as this travis run was using same docker version |
Attempting a possible fix in #4155 |
@chetanmeh in your code snippet above, you used 1 KiB as page size. On Linux, the page size typically is 4 KiB:
The code should work regardless of this problem because it will just fill more memory locations than absolutely required. I'm surprised that your test code exactly stays within the rss memory limit... that's why you don't see the OOM kill. It seems that all Node.js binary code is account for as shared memory. Limit:
Max usage:
Current usage:
|
@sven-lange-last I tried with 4kb also and went beyond the limit but still its usage remains at 128 MB. On my desktop I do see warning
So may be it does not support cgroup limit. Thoughts its 16.04 which should have that support by default. The travis build do enforce memory limit and that works for simple runs (as seen in linked travis run). However for some reason it fails in test runs. So it may be due to some form of memory crunch which cause the range error before hitting OOM. For now multiple runs of new approach in #4155 passed. So that may fix this issue |
#4155 was merged and should address this heisenbug. Please re-open if the test fails again occasionally. |
I've seen that test failing a couple of times.
The reason for the failure was
"limits should be aborted when exceeding its memory limits"
the test tries to verify that an action is aborted when it hits a set memory limit but the abortion is not happening, so when the test tries to verify that an error exists on the response the test will fail.
A discussion occurred on a slack conversation. Here is a transcript of what was discussed as the solution to this issue.
The text was updated successfully, but these errors were encountered: