Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mon_group support for resctrl. #2793

Merged
merged 53 commits into from
Sep 21, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
0f93f1a
Add mon_group support for resctrl.
Jan 28, 2021
3452cf3
Do not try to setup the root container.
Mar 15, 2021
18d297d
Update klog version to avoid errors from golangci-lint.
Mar 15, 2021
3feb264
Update klog version in cmd to avoid errors from golangci-lint.
Mar 15, 2021
7aaaff7
Fix go.sum
Mar 15, 2021
58a2e01
Check if container moved between control groups only if its running.
Mar 15, 2021
5561bd9
Get NUMA nodes from MachineInfo.Topology.
Mar 16, 2021
9a1e6cb
Make code thread safe again.
Mar 16, 2021
873cc47
Fix typo.
Mar 16, 2021
d07084c
Refactor resctrl collector setup.
Mar 17, 2021
801e883
Refactor resctrl utilies.
Mar 17, 2021
75fabd7
Better name vars.
Mar 17, 2021
a1619a7
Add missing python3 in Dockerfile.
Mar 18, 2021
0f274e4
Add missing procps in Dockerfile.
Mar 18, 2021
73c71b6
Merge branch 'master' of github.com:google/cadvisor into creatone/res…
May 10, 2021
03f8571
Use const instead of magic value.
May 10, 2021
29e1e19
Delete an unnecessary setting of c.running to false.
May 10, 2021
fc3b8ce
Do not wrap the error from cAdvisor.
May 10, 2021
ecb0156
Use path in error message.
May 10, 2021
e1f5e9b
Avoid goroutine looping.
May 11, 2021
3951953
Do not use fscommon package from runc/libcontainer.
May 11, 2021
7d36305
Fix const ASCII names.
May 11, 2021
ee42de2
Use same operator in func.
May 12, 2021
7c1eeb2
Introduce const variables.
May 12, 2021
a1ce724
Merge branch 'master' of github.com:google/cadvisor into creatone/res…
May 12, 2021
3ffe422
Introduce vendor_id in MachineInfo.
Jun 18, 2021
ab3ee30
Extend files which should be omitted when searching control group.
Jun 18, 2021
a986999
Add info about possible bug when reading resctrl values on AMD.
Jun 18, 2021
88b6b7c
Use empty struct map instead of boolean.
Aug 16, 2021
8bf947a
Move reading file logic.
Aug 16, 2021
c54e007
Use scanner to read tasks file.
Aug 16, 2021
dbb54d2
Change the way of searching for the control group.
Aug 16, 2021
731606d
Add comments. Use const value.
Aug 16, 2021
15be02e
Comment function.
Aug 16, 2021
d24ece6
Fix typo.
Aug 18, 2021
c9da6fb
Refactor getAllProcessThreads.
Aug 18, 2021
8162197
Refactor GetVendorID.
Aug 18, 2021
9c675e4
Rename VendorID.
Aug 18, 2021
a76478c
Resctrl collector should be aware of existing mon groups.
Aug 19, 2021
852f755
Optimization for finding control/monitoring group.
Aug 20, 2021
97987d8
Avoid having ugly errors.
Aug 20, 2021
c675d56
Merge branch 'master' of github.com:google/cadvisor into creatone/res…
Aug 24, 2021
e7628e7
Use strings.HasPrefix().
Aug 25, 2021
1db6ca2
Add comments.
Aug 25, 2021
fa9b5db
Rename variables.
Aug 25, 2021
b3f311b
Fix test.
Aug 25, 2021
bbde636
Use string map instead of int.
Aug 25, 2021
ea39458
Now there is no need to use procps in Dockerfile.
Sep 9, 2021
0ec4da2
Merge branch 'master' of github.com:google/cadvisor into creatone/res…
Sep 9, 2021
b76ee15
Update to go 1.17.
Sep 9, 2021
8c734a0
Add information about possible race condition.
Sep 9, 2021
fb4b9c8
Add warning when docker_only is not set.
Sep 10, 2021
00b42cf
Fix typo.
Sep 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion cmd/cadvisor.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ var rawCgroupPrefixWhiteList = flag.String("raw_cgroup_prefix_whitelist", "", "A

var perfEvents = flag.String("perf_events_config", "", "Path to a JSON file containing configuration of perf events to measure. Empty value disabled perf events measuring.")

var resctrlInterval = flag.Duration("resctrl_interval", 0, "Resctrl mon groups updating interval. Zero value disables updating mon groups.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "updating mon groups"? Are you going to add and/or remove pids from them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think that --housekeeping_interval is not enough and that we need another flag? We still can exclude restctrl metrics using disable_metrics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, watching mon groups and making sure that have correct pids.
Thanks to that flag we can monitor containers that changed their control group during the cAdvisor run.

Setting --resctrl_interval to 0, disables updating mon groups.
This is good behavior for users that want to use only monitoring features of Resctrl.

We cannot depend on --housekeeping_interval because cAdvisor will be updating containers instantly:

...
I0315 12:54:47.268474   22054 container.go:593] [/some_cgroup] Housekeeping took 331.572µs
I0315 12:54:47.268948   22054 container.go:593] [/some_cgroup] Housekeeping took 429.79µs
I0315 12:54:47.269348   22054 container.go:593] [/some_cgroup] Housekeeping took 356.188µs
I0315 12:54:47.269363   22054 container.go:593] [/machine.slice/libpod-conmon-e361822894402b35f7230ccd133725e8d52e97c8eddd3ba77cd0dc3fc9ee46ad.scope] Housekeeping took 1.506114ms
I0315 12:54:47.269385   22054 container.go:593] [/system.slice/thermos.service] Housekeeping took 1.48471ms
I0315 12:54:47.269919   22054 container.go:593] [/some_cgroup] Housekeeping took 528.066µs
I0315 12:54:47.270333   22054 container.go:593] [/some_cgroup] Housekeeping took 371.664µs
I0315 12:54:47.270611   22054 container.go:593] [/some_cgroup] Housekeeping took 220.11µs
I0315 12:54:47.270855   22054 container.go:593] [/some_cgroup] Housekeeping took 188.26µs
I0315 12:54:47.271074   22054 container.go:593] [/machine.slice/libpod-conmon-df33bd5e05ec42afd4c39cc0a1417c227553de2bd7fc510de00fb7341d59d263.scope] Housekeeping took 1.682142ms
I0315 12:54:47.271163   22054 container.go:593] [/some_cgroup] Housekeeping took 250.002µs
I0315 12:54:47.271407   22054 container.go:593] [/some_cgroup] Housekeeping took 185.349µs
...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate more about

Setting --resctrl_interval to 0, disables updating mon groups.
This is good behavior for users that want to use only monitoring features of Resctrl.

I'm afraid that it won't be enough in case where other components (cri-rm, runc, manually) manage ctr/mon groups manually. At least this comment suggest that "enabling rdt" + restctrl_intreval=0 is enough, which is not true.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider the necessity of --restrcl_interval flag, because it makes another cmdline options that is hard to understand but gives a chance to limit overhead in case that there is no other manager of resctrl hierarchy. I'm not yet full convinced it is worh. Pawel can you provide rough estimation of overhead - I mean e.g. number of syscalls (listdir,readfile, cpuusage) in typical setup (e.g. 20 pods + system cgroups).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something will change the control group, the cAdvisor will continue measuring in the new group.
cAdvisor manipulates only mon groups. Now it's aware of existing groups, even own prepared have "cadvisor" prefix,
e.g. cadvisor-system.slice-docker-0f07bb8ed38c0f40c9424875f2cf417d19b27ca93b631640a2039b3478ee5858.scope
So will manipulate only the groups with that prefix.


var (
// Metrics to be ignored.
// Tcp metrics are ignored by default.
Expand Down Expand Up @@ -173,7 +175,7 @@ func main() {

collectorHttpClient := createCollectorHttpClient(*collectorCert, *collectorKey)

resourceManager, err := manager.New(memoryStorage, sysFs, housekeepingConfig, includedMetrics, &collectorHttpClient, strings.Split(*rawCgroupPrefixWhiteList, ","), *perfEvents)
resourceManager, err := manager.New(memoryStorage, sysFs, housekeepingConfig, includedMetrics, &collectorHttpClient, strings.Split(*rawCgroupPrefixWhiteList, ","), *perfEvents, *resctrlInterval)
if err != nil {
klog.Fatalf("Failed to create a manager: %s", err)
}
Expand Down
2 changes: 1 addition & 1 deletion cmd/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ require (
golang.org/x/oauth2 v0.0.0-20200902213428-5d25da1a8d43
google.golang.org/api v0.34.0
gopkg.in/olivere/elastic.v2 v2.0.12
k8s.io/klog/v2 v2.2.0
k8s.io/klog/v2 v2.8.0
k8s.io/utils v0.0.0-20201110183641-67b214c5f920
)
8 changes: 4 additions & 4 deletions cmd/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,8 @@ github.com/go-logfmt/logfmt v0.3.0/go.mod h1:Qt1PoO58o5twSAckw1HlFXLmHsOX5/0LbT9
github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V4qmtdjCk=
github.com/go-logfmt/logfmt v0.5.0/go.mod h1:wCYkCAKZfumFQihp8CzCvQ3paCTfi41vtzG1KdI/P7A=
github.com/go-logr/logr v0.1.0/go.mod h1:ixOQHD9gLJUVQQ2ZOR7zLEifBX6tGkNJF4QyIY7sIas=
github.com/go-logr/logr v0.2.0 h1:QvGt2nLcHH0WK9orKa+ppBPAxREcH364nPUedEpK0TY=
github.com/go-logr/logr v0.2.0/go.mod h1:z6/tIYblkpsD+a4lm/fGIIU9mZ+XfAiaFtq7xTgseGU=
github.com/go-logr/logr v0.4.0 h1:K7/B1jt6fIBQVd4Owv2MqGQClcgf0R266+7C/QjRcLc=
github.com/go-logr/logr v0.4.0/go.mod h1:z6/tIYblkpsD+a4lm/fGIIU9mZ+XfAiaFtq7xTgseGU=
github.com/go-sql-driver/mysql v1.4.0/go.mod h1:zAC/RDZ24gD3HViQzih4MyKcchzm+sOG5ZlKdlhCg5w=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/godbus/dbus/v5 v5.0.3 h1:ZqHaoEF7TBzh4jzPmqVhE/5A1z9of6orkAe5uHoAeME=
Expand Down Expand Up @@ -822,8 +822,8 @@ honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt
honnef.co/go/tools v0.0.1-2020.1.3/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k=
honnef.co/go/tools v0.0.1-2020.1.4/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k=
k8s.io/klog/v2 v2.0.0/go.mod h1:PBfzABfn139FHAV07az/IF9Wp1bkk3vpT2XSJ76fSDE=
k8s.io/klog/v2 v2.2.0 h1:XRvcwJozkgZ1UQJmfMGpvRthQHOvihEhYtDfAaxMz/A=
k8s.io/klog/v2 v2.2.0/go.mod h1:Od+F08eJP+W3HUb4pSrPpgp9DGU4GzlpG/TmITuYh/Y=
k8s.io/klog/v2 v2.8.0 h1:Q3gmuM9hKEjefWFFYF0Mat+YyFJvsUyYuwyNNJ5C9Ts=
k8s.io/klog/v2 v2.8.0/go.mod h1:hy9LJ/NvuK+iVyP4Ehqva4HxZG/oXyIS3n3Jmire4Ec=
k8s.io/utils v0.0.0-20201110183641-67b214c5f920 h1:CbnUZsM497iRC5QMVkHwyl8s2tB3g7yaSHkYPkpgelw=
k8s.io/utils v0.0.0-20201110183641-67b214c5f920/go.mod h1:jPW/WVKK9YHAvNhRxK0md/EJ228hCsBRufyofKtW8HA=
rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8=
Expand Down
4 changes: 2 additions & 2 deletions deploy/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM alpine:3.12 AS build

RUN apk --no-cache add libc6-compat device-mapper findutils zfs build-base linux-headers go bash git wget cmake pkgconfig ndctl-dev && \
RUN apk --no-cache add libc6-compat device-mapper findutils zfs build-base linux-headers go python3 bash git wget cmake pkgconfig ndctl-dev && \
apk --no-cache add thin-provisioning-tools --repository http://dl-3.alpinelinux.org/alpine/edge/main/ && \
echo 'hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4' >> /etc/nsswitch.conf && \
rm -rf /var/cache/apk/*
Expand Down Expand Up @@ -34,7 +34,7 @@ RUN ./build/build.sh
FROM alpine:3.12
MAINTAINER dengnan@google.com vmarmol@google.com vishnuk@google.com jimmidyson@gmail.com stclair@google.com

RUN apk --no-cache add libc6-compat device-mapper findutils zfs ndctl && \
RUN apk --no-cache add libc6-compat device-mapper findutils zfs ndctl procps && \
Creatone marked this conversation as resolved.
Show resolved Hide resolved
apk --no-cache add thin-provisioning-tools --repository http://dl-3.alpinelinux.org/alpine/edge/main/ && \
echo 'hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4' >> /etc/nsswitch.conf && \
rm -rf /var/cache/apk/*
Expand Down
5 changes: 5 additions & 0 deletions docs/runtime_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,11 @@ should be a human readable string that will become a metric name.
* `cas_count_read` will be measured as uncore non-grouped event on all Integrated Memory Controllers Performance Monitoring Units because of unset `type` field and
`uncore_imc` prefix.

## Resctrl

```
--resctrl_interval=0: Resctrl mon groups updating interval. Zero value disables updating mon groups.
```

## Storage driver specific instructions:

Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ require (
google.golang.org/grpc v1.27.1
google.golang.org/protobuf v1.25.0 // indirect
gotest.tools/v3 v3.0.3 // indirect
k8s.io/klog/v2 v2.2.0
k8s.io/klog/v2 v2.8.0
k8s.io/utils v0.0.0-20201110183641-67b214c5f920
)
8 changes: 4 additions & 4 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,8 @@ github.com/go-kit/kit v0.9.0/go.mod h1:xBxKIO96dXMWWy0MnWVtmwkA9/13aqxPnvrjFYMA2
github.com/go-logfmt/logfmt v0.3.0/go.mod h1:Qt1PoO58o5twSAckw1HlFXLmHsOX5/0LbT9GBnD5lWE=
github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V4qmtdjCk=
github.com/go-logr/logr v0.1.0/go.mod h1:ixOQHD9gLJUVQQ2ZOR7zLEifBX6tGkNJF4QyIY7sIas=
github.com/go-logr/logr v0.2.0 h1:QvGt2nLcHH0WK9orKa+ppBPAxREcH364nPUedEpK0TY=
github.com/go-logr/logr v0.2.0/go.mod h1:z6/tIYblkpsD+a4lm/fGIIU9mZ+XfAiaFtq7xTgseGU=
github.com/go-logr/logr v0.4.0 h1:K7/B1jt6fIBQVd4Owv2MqGQClcgf0R266+7C/QjRcLc=
github.com/go-logr/logr v0.4.0/go.mod h1:z6/tIYblkpsD+a4lm/fGIIU9mZ+XfAiaFtq7xTgseGU=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/godbus/dbus/v5 v5.0.3 h1:ZqHaoEF7TBzh4jzPmqVhE/5A1z9of6orkAe5uHoAeME=
github.com/godbus/dbus/v5 v5.0.3/go.mod h1:xhWf0FNVPg57R7Z0UbKHbJfkEywrmjJnf7w5xrFpKfA=
Expand Down Expand Up @@ -489,8 +489,8 @@ honnef.co/go/tools v0.0.0-20190523083050-ea95bdfd59fc/go.mod h1:rf3lG4BRIbNafJWh
honnef.co/go/tools v0.0.1-2019.2.3/go.mod h1:a3bituU0lyd329TUQxRnasdCoJDkEUEAqEt0JzvZhAg=
honnef.co/go/tools v0.0.1-2020.1.3/go.mod h1:X/FiERA/W4tHapMX5mGpAtMSVEeEUOyHaw9vFzvIQ3k=
k8s.io/klog/v2 v2.0.0/go.mod h1:PBfzABfn139FHAV07az/IF9Wp1bkk3vpT2XSJ76fSDE=
k8s.io/klog/v2 v2.2.0 h1:XRvcwJozkgZ1UQJmfMGpvRthQHOvihEhYtDfAaxMz/A=
k8s.io/klog/v2 v2.2.0/go.mod h1:Od+F08eJP+W3HUb4pSrPpgp9DGU4GzlpG/TmITuYh/Y=
k8s.io/klog/v2 v2.8.0 h1:Q3gmuM9hKEjefWFFYF0Mat+YyFJvsUyYuwyNNJ5C9Ts=
k8s.io/klog/v2 v2.8.0/go.mod h1:hy9LJ/NvuK+iVyP4Ehqva4HxZG/oXyIS3n3Jmire4Ec=
k8s.io/utils v0.0.0-20201110183641-67b214c5f920 h1:CbnUZsM497iRC5QMVkHwyl8s2tB3g7yaSHkYPkpgelw=
k8s.io/utils v0.0.0-20201110183641-67b214c5f920/go.mod h1:jPW/WVKK9YHAvNhRxK0md/EJ228hCsBRufyofKtW8HA=
rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8=
Expand Down
3 changes: 2 additions & 1 deletion manager/container.go
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ func (cd *containerData) Stop() error {
}
close(cd.stop)
cd.perfCollector.Destroy()
cd.resctrlCollector.Destroy()
return nil
}

Expand Down Expand Up @@ -727,7 +728,7 @@ func (cd *containerData) updateStats() error {
return perfStatsErr
}
if resctrlStatsErr != nil {
klog.Errorf("error occurred while collecting resctrl stats for container %s: %s", cInfo.Name, err)
klog.Errorf("error occurred while collecting resctrl stats for container %s: %s", cInfo.Name, resctrlStatsErr)
return resctrlStatsErr
}
return customStatsErr
Expand Down
24 changes: 10 additions & 14 deletions manager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ import (

"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fs2"
"github.com/opencontainers/runc/libcontainer/intelrdt"

"k8s.io/klog/v2"
"k8s.io/utils/clock"
Expand Down Expand Up @@ -147,7 +146,7 @@ type HouskeepingConfig = struct {
}

// New takes a memory storage and returns a new manager.
func New(memoryCache *memory.InMemoryCache, sysfs sysfs.SysFs, houskeepingConfig HouskeepingConfig, includedMetricsSet container.MetricSet, collectorHTTPClient *http.Client, rawContainerCgroupPathPrefixWhiteList []string, perfEventsFile string) (Manager, error) {
func New(memoryCache *memory.InMemoryCache, sysfs sysfs.SysFs, houskeepingConfig HouskeepingConfig, includedMetricsSet container.MetricSet, collectorHTTPClient *http.Client, rawContainerCgroupPathPrefixWhiteList []string, perfEventsFile string, resctrlInterval time.Duration) (Manager, error) {
if memoryCache == nil {
return nil, fmt.Errorf("manager requires memory storage")
}
Expand Down Expand Up @@ -218,7 +217,7 @@ func New(memoryCache *memory.InMemoryCache, sysfs sysfs.SysFs, houskeepingConfig
return nil, err
}

newManager.resctrlManager, err = resctrl.NewManager(selfContainer)
newManager.resctrlManager, err = resctrl.NewManager(resctrlInterval, resctrl.Setup)
if err != nil {
klog.V(4).Infof("Cannot gather resctrl metrics: %v", err)
}
Expand Down Expand Up @@ -263,7 +262,7 @@ type manager struct {
collectorHTTPClient *http.Client
nvidiaManager stats.Manager
perfManager stats.Manager
resctrlManager stats.Manager
resctrlManager resctrl.Manager
// List of raw container cgroup path prefix whitelist.
rawContainerCgroupPathPrefixWhiteList []string
}
Expand Down Expand Up @@ -328,7 +327,7 @@ func (m *manager) Start() error {

func (m *manager) Stop() error {
defer m.nvidiaManager.Destroy()
defer m.destroyPerfCollectors()
defer m.destroyCollectors()
// Stop and wait on all quit channels.
for i, c := range m.quitChannels {
// Send the exit signal and wait on the thread to exit (by closing the channel).
Expand All @@ -346,9 +345,10 @@ func (m *manager) Stop() error {
return nil
}

func (m *manager) destroyPerfCollectors() {
func (m *manager) destroyCollectors() {
for _, container := range m.containers {
container.perfCollector.Destroy()
container.resctrlCollector.Destroy()
}
}

Expand Down Expand Up @@ -957,14 +957,11 @@ func (m *manager) createContainerLocked(containerName string, watchSource watche
}

if m.includedMetrics.Has(container.ResctrlMetrics) {
resctrlPath, err := intelrdt.GetIntelRdtPath(containerName)
cont.resctrlCollector, err = m.resctrlManager.GetCollector(containerName, func() ([]string, error) {
return cont.getContainerPids(true)
}, len(m.machineInfo.Topology))
if err != nil {
klog.V(4).Infof("Error getting resctrl path: %q", err)
} else {
cont.resctrlCollector, err = m.resctrlManager.GetCollector(resctrlPath)
if err != nil {
klog.V(4).Infof("resctrl metrics will not be available for container %s: %s", cont.info.Name, err)
}
klog.V(4).Infof("resctrl metrics will not be available for container %s: %s", cont.info.Name, err)
}
}

Expand Down Expand Up @@ -1006,7 +1003,6 @@ func (m *manager) createContainerLocked(containerName string, watchSource watche
if err != nil {
return err
}

// Start the container's housekeeping.
return cont.Start()
}
Expand Down
153 changes: 119 additions & 34 deletions resctrl/collector.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// +build linux

// Copyright 2020 Google Inc. All Rights Reserved.
// Copyright 2021 Google Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand All @@ -18,57 +18,142 @@
package resctrl

import (
info "github.com/google/cadvisor/info/v1"
"github.com/google/cadvisor/stats"
"fmt"
"os"
"sync"
"time"

"k8s.io/klog/v2"

"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/intelrdt"
info "github.com/google/cadvisor/info/v1"
)

type collector struct {
resctrl intelrdt.Manager
stats.NoopDestroy
id string
interval time.Duration
getContainerPids func() ([]string, error)
resctrlPath string
running bool
numberOfNUMANodes int
mu sync.Mutex
}

func newCollector(id string, resctrlPath string) *collector {
collector := &collector{
resctrl: intelrdt.NewManager(
&configs.Config{
IntelRdt: &configs.IntelRdt{},
},
id,
resctrlPath,
),
func newCollector(id string, getContainerPids func() ([]string, error), interval time.Duration, numberOfNUMANodes int) *collector {
return &collector{id: id, interval: interval, getContainerPids: getContainerPids, numberOfNUMANodes: numberOfNUMANodes,
mu: sync.Mutex{}}
}

func (c *collector) setup() error {
var err error
c.resctrlPath, err = prepareMonitoringGroup(c.id, c.getContainerPids)

if c.interval != 0 {
if err != nil {
c.running = false
klog.Errorf("Failed to setup container %q resctrl collector: %w \n Trying again in next intervals.", c.id, err)
} else {
c.running = true
Creatone marked this conversation as resolved.
Show resolved Hide resolved
}
go func() {
for {
time.Sleep(c.interval)
c.mu.Lock()
klog.V(5).Infof("Trying to check %q containers control group.", c.id)
if c.running {
Creatone marked this conversation as resolved.
Show resolved Hide resolved
err = c.checkMonitoringGroup()
if err != nil {
c.running = false
klog.Errorf("Failed to check %q resctrl collector control group: %w \n Trying again in next intervals.", c.id, err)
}
} else {
c.resctrlPath, err = prepareMonitoringGroup(c.id, c.getContainerPids)
if err != nil {
c.running = false
Creatone marked this conversation as resolved.
Show resolved Hide resolved
klog.Errorf("Failed to setup container %q resctrl collector: %w \n Trying again in next intervals.", c.id, err)
} else {
c.running = true
Creatone marked this conversation as resolved.
Show resolved Hide resolved
}
}
c.mu.Unlock()
}
}()
} else {
// There is no interval set, if setup fail, stop.
Creatone marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
return fmt.Errorf("failed to setup container %q resctrl collector: %w", c.id, err)
}
c.running = true
Creatone marked this conversation as resolved.
Show resolved Hide resolved
}

return collector
return nil
}

func (c *collector) UpdateStats(stats *info.ContainerStats) error {
stats.Resctrl = info.ResctrlStats{}

resctrlStats, err := c.resctrl.GetStats()
func (c *collector) checkMonitoringGroup() error {
newPath, err := prepareMonitoringGroup(c.id, c.getContainerPids)
if err != nil {
return err
return fmt.Errorf("couldn't obtain mon_group path: %v", err)
}

numberOfNUMANodes := len(*resctrlStats.MBMStats)
// Check if container moved between control groups.
iwankgb marked this conversation as resolved.
Show resolved Hide resolved
if newPath != c.resctrlPath {
err = c.clear()
if err != nil {
return fmt.Errorf("couldn't clear previous monitoring group: %w", err)
}
c.resctrlPath = newPath
}

stats.Resctrl.MemoryBandwidth = make([]info.MemoryBandwidthStats, 0, numberOfNUMANodes)
stats.Resctrl.Cache = make([]info.CacheStats, 0, numberOfNUMANodes)
return nil
}

for _, numaNodeStats := range *resctrlStats.MBMStats {
stats.Resctrl.MemoryBandwidth = append(stats.Resctrl.MemoryBandwidth,
info.MemoryBandwidthStats{
TotalBytes: numaNodeStats.MBMTotalBytes,
LocalBytes: numaNodeStats.MBMLocalBytes,
})
func (c *collector) UpdateStats(stats *info.ContainerStats) error {
c.mu.Lock()
defer c.mu.Unlock()
if c.running {
stats.Resctrl = info.ResctrlStats{}

resctrlStats, err := getIntelRDTStatsFrom(c.resctrlPath)
if err != nil {
return err
}

stats.Resctrl.MemoryBandwidth = make([]info.MemoryBandwidthStats, 0, c.numberOfNUMANodes)
stats.Resctrl.Cache = make([]info.CacheStats, 0, c.numberOfNUMANodes)

for _, numaNodeStats := range *resctrlStats.MBMStats {
stats.Resctrl.MemoryBandwidth = append(stats.Resctrl.MemoryBandwidth,
info.MemoryBandwidthStats{
TotalBytes: numaNodeStats.MBMTotalBytes,
LocalBytes: numaNodeStats.MBMLocalBytes,
})
}

for _, numaNodeStats := range *resctrlStats.CMTStats {
stats.Resctrl.Cache = append(stats.Resctrl.Cache,
info.CacheStats{LLCOccupancy: numaNodeStats.LLCOccupancy})
}
}

for _, numaNodeStats := range *resctrlStats.CMTStats {
stats.Resctrl.Cache = append(stats.Resctrl.Cache,
info.CacheStats{LLCOccupancy: numaNodeStats.LLCOccupancy})
return nil
}

func (c *collector) Destroy() {
c.mu.Lock()
defer c.mu.Unlock()
c.running = false
err := c.clear()
Creatone marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
klog.Errorf("trying to destroy %q resctrl collector but: %v", c.id, err)
}
}

func (c *collector) clear() error {
// Not allowed to remove root or undefined resctrl directory.
if c.id != rootContainer && c.resctrlPath != "" {
err := os.RemoveAll(c.resctrlPath)
if err != nil {
return fmt.Errorf("couldn't clear mon_group: %v", err)
}
}
return nil
}
Loading