-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cue OOM killer #147
Comments
It might be the way A quick fix might be, "don't use Is there a memory ballooning problem if each holos component were rendered as individual commands instead of one catch all |
To enumerate all of the instances that could be run in separate processes with xargs instead of run in the for loop in the Builder Run method.
I'll leave it here with you nate. I think my hypothesis in the above comment is probably the quick band-aid fix. The hypothesis holds. When running with
gc 152 9244 MB goal
However, we can use the holos render --cluster-name=k2 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/foundation/cloud/... --print-instances \
| GODEBUG=gctrace=1 xargs -t -P1 -I% holos render --cluster-name=k2 % 2>&1 | tee foo.txt https://gist.github.com/jeffmccune/bf8f634f7462b1916e7a0d3383e3d354 Quick scan of this gist provides some insight. Some components don't take much memory at all. prod-mesh-gatewaytakes a lot, 2463 MB goal at the end prod-platform-obs is only 58 MB goal, maybe it completely bypasses the projects structures? etc... Overall though it is a quick win to spread the components out, the largest balloon is around 2 GiB instead of 10GiB. |
In Slack, Jeff mentioned that https://github.com/holos-run/holos/blob/v0.70.0/docs/examples/platforms/reference/clusters/foundation/cloud/mesh/mesh.cue#L11 might be a big contributor to memory requirements.
|
I processed the GC logs while rendering each cluster's individual instances and found that the Render memory usage per instance per cluster
Render memory-usage logs were generated with this script#!/bin/bash
: "${HOLOS_REPO:=${HOME}/workspace/holos-run/holos}"
: "${LOG_DIR:=${HOME}/Desktop/render-logs}"
[[ -d $LOG_DIR ]] || mkdir -p "$LOG_DIR"
# Provisioner
for cluster in provisioner; do
for platform in reference holos-saas; do
holos render --print-instances --cluster-name=$cluster "${HOLOS_REPO}/docs/examples/platforms/${platform}/clusters/provisioner/..." \
| GODEBUG=gctrace=1 xargs -P1 -t -L1 time holos render --cluster-name=$cluster 2>&1 \
| tee "${LOG_DIR}/render-log-instances-${platform}-${cluster}.txt"
done
done
# Workload clusters
for cluster in k1 k2; do
for cluster_type in foundation workload; do
holos render --print-instances --cluster-name=$cluster "${HOLOS_REPO}/docs/examples/platforms/reference/clusters/${cluster_type}/..." \
| GODEBUG=gctrace=1 xargs -P1 -t -L1 time holos render --cluster-name=$cluster 2>&1 \
| tee "${LOG_DIR}/render-log-instances-${cluster_type}-${cluster}.txt"
done
done
# core1 and core2
for cluster in core1 core2; do
for cluster_type in accounts foundation workload optional; do
holos render --print-instances --cluster-name=$cluster "${HOLOS_REPO}/docs/examples/platforms/reference/clusters/${cluster_type}/..." \
| GODEBUG=gctrace=1 xargs -P1 -t -L1 time holos render --cluster-name=$cluster 2>&1 \
| tee "${LOG_DIR}/render-log-instances-${cluster_type}-${cluster}.txt"
done
done
# Holos Saas
for cluster in k2; do
for platform in holos-saas; do
holos render --print-instances --cluster-name=$cluster "${HOLOS_REPO}/docs/examples/platforms/${platform}/clusters/workload/..." \
| GODEBUG=gctrace=1 xargs -P1 -t -L1 time holos render --cluster-name=$cluster 2>&1 \
| tee "${LOG_DIR}/render-log-instances-${platform}-${cluster}.txt"
done
done ./hack/find-mem-hogs.py ~/Desktop/render-logs/*.txt | sort -rn | column -t#!/usr/bin/env python3
import sys
# Exmaple line to get the section name from, which is a combination of the cluster name and the file path:
# time holos render --cluster-name=core1 /Users/nate/src/holos-run/holos/docs/examples/platforms/reference/clusters/accounts/iam
def extract_section_name(line):
# Extract cluster name from line
cluster_name = line.split()[3].split("=")[1]
# Extract file path from line
file_path = line.split()[4]
# Remove the leading paths up until "platforms/"
file_path = file_path[file_path.find("platforms/") :]
return f"{cluster_name} {file_path}"
# Extract goal value from log line. Example line:
# gc 1 @0.005s 2%: 0.029+0.54+0.026 ms clock, 0.35+0/0.87/0.17+0.32 ms cpu, 3->3->1 MB, 4 MB goal, 0 MB stacks, 1 MB globals, 12 P
def extract_goal_value(line):
# Extract goal value from line
goal = int(line.split(",")[3].split()[0])
return goal
largest_goals = {}
# Get the log files to parse as command line arguments. There could be 1 or more log files.
log_files = sys.argv[1:]
# Read log file line by line
for log_file in log_files:
with open(log_file, "r") as file:
for line in file:
if line.startswith("time holos render"):
section = extract_section_name(line)
# Update largest goal for section if necessary
if section not in largest_goals:
largest_goals[section] = 0
if line.startswith("gc "):
# Section is the most recent key added to largest_goals.
section = list(largest_goals)[-1]
goal = extract_goal_value(line)
# Update largest goal for section if necessary
if section not in largest_goals or goal > largest_goals[section]:
largest_goals[section] = goal
# Print the largest goal for each section
for section, largest_goal in largest_goals.items():
cluster, path = section.split()
print(f"{largest_goal} {cluster} {path}") |
(#147) Add holos render --print-instances flag
Problem commitsUsing Git bisect, I found the following two commits to be the main causes of memory issues. |
I wasn't able to make headway on improving memory usage in the Go or Cue code. I tried a few suggestions like removing unneeded I did update I'm going to stop here as this is good enough for now. While researching this, I found that Cuelang has a lot of open issues about performance issues and memory leaks, and that this is an active area of development and interest by the Cue developers. It seems like future versions of Cue might end up fixing memory problems for us, so I recommend we try new Alpha and Beta versions of Cue as they are released.
Another interesting possible follow-up to this issue is looking at Unity, Cue's automated performance and regression testing framework: |
Problem:
On my 32GiB workstation with 1G swap, the following command results in multiple processes consuming over 30% of system memory. The linux oom killer kicks in and starts sending kill -9's to random processes.
Solution:
???
Result:
We have a way to limit memory usage. It's acceptable to run
holos
in parallel, but we need to get usage under 4Gi otherwise we won't be able to run it inside of pods with reasonable resource limits in place.Where to start
Note
See CUE_STATS_FILE which is undocumented, but may be useful.
#ProjectHosts
is brutal and enumerates all hosts for a project#EnvHosts
is brutal but it shouldn't be used much since it was the first stab forhttpbin
- project-template.cueThe text was updated successfully, but these errors were encountered: