-
Notifications
You must be signed in to change notification settings - Fork 324
Out of Memory
On embedded devices we are regularly dealing with a very limited amount of flash and RAM. This page serves the purpose of helping with and tracking the status of the latter, RAM issues.
The motivation of this page is ticket #1243 in particular.
Collect any external things that might help other people to understand and debug OOM issues here.
- [Add some links here, explaining userspace vs. kernelspace allocations, VIRT/RES/SHR, pages, slabs, kmalloc(), kmem_cache_alloc(), vmalloc(), /proc/slabinfo, /proc/vmallocinfo, /proc/vmstat, echo 'm' > /proc/sysrq-trigger,...]
- Build Gluon with
- On OOM and after reboot, get crash report from /sys/kernel/debug/crashlog
- Try to find a reproducable, isolated setup!
- Observe:
- /proc/slabinfo
- /proc/vmallocinfo
- /proc/vmstat
- echo 'm' > /proc/sysrq-trigger; dmesg
- /sys/kernel/debug/ieee80211/phy0/aqm
- Helpful tools:
- Traffic monitoring: tcpdump, wireshark, etc.
- Traffic generators: mausezahn, iperf, tcpreplay, etc.
- ...
- Profit
Status: Unsolved
Issue: OOM due to allocations in kernelspace.
Related tickets: #1243, #1306, #1197
How to trigger: In networks with a high number of nodes?
Observations so far:
- First observed after the first Gluon releases based on LEDE
- Nothing suspicious in /proc/slabinfo on crash
- Seems to outrule the Linux bridge or batman-adv as a potential causes
- Setting 'echo fq_memory_limit 200 > /sys/kernel/debug/ieee80211/phy0/aqm' (seemingly?) had a positive effect
Tasks:
- Finding a setup to reproduce the issue in an isolated configuration.
NeoRaider's test wishlist:
- Observe reported memory usage in /sys/kernel/debug/ieee80211/phy0/aqm before crash
- Check if OOM is reproducible with all WLAN disabled (both mesh and AP), but active VPN (and possibly wired mesh)
- On a node with WLAN mesh only:
- Check if crash is reproducible with disabled AP
- Test different values for fq_memory_limit, can be set in
/etc/hotplug.d/ieee80211/01-gluon-core-codel-memusage
. What is the highest value that fixes crashes reliably - Like ii., but with disabled AP
All of these tests should be done on both the master and the next branch of Gluon.
Status: Solved (unreleased)
Issue: IPv4+v6 fragmentation buffers may buffer packets of up to a size of 8MB in total (4MB per address family)
Related tickets: -
How to trigger: An OOM was easily triggered via iperf3 running on a node, if packets were fragmented ($ iperf3 -l 1500). However should potentially be triggerable with no extra tools on the node and just external traffic, too?
Mitigated in latest master and v2017.1.x. Additional firewall rules might be considered, too.
Status: Solved
Issue: In setups involving ~2500 client devices, nodes crashed frequently. The issue was alfred and respondd accessing the global batman-adv translation table via debugfs which caused high-order memory allocations due to the large table size.
Related tickets: #753
How to trigger: Spawn >2500 client devices, then 'cat /sys/kernel/debug/batman_adv/bat0/transtable_global'
The issue was fixed by implementing a netlink based interface in batman-adv and using that for alfred and respondd to access the global batman-adv translation table.
-
Usage
-
Community
-
Development
- Device Integration
- Roadmap
- Release-life-cycle
- Protocols
- Meeting 2024/06
- Meeting 2024/05
- Meeting 2024/03
- Meeting 2024/02
- Meeting 2024/01
- Meeting 2023/06
- Meeting 2023/05
- Meetup-CCCamp
- Meeting 2023/04
- Meeting 2023/03
- Meeting 2023/02
- Meeting 2023/01
- Meeting 2022/06
- Meeting 2022/05
- Meeting 2022/04
- Meeting 2022/03
- Meeting 2022/02
- Meeting 2022/01
- Meeting 2021/01
- Meeting 2019/01
- Meeting 2018/03
- Meeting 2018/02
- Meeting 2018/01
- Meeting 2017/01
- Concepts
- Release Process
-
Debugging