Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: collect the metrics of nydusd events #263

Merged
merged 1 commit into from
Nov 29, 2022

Conversation

sctb512
Copy link
Member

@sctb512 sctb512 commented Nov 25, 2022

Collect the metrics of nydus daemon events, including INIT, RUNNING and DIED.

Result:

curl -s --unix-socket /var/lib/containerd-nydus/api.sock http://unix/metrics | grep "lifetime_event"

# HELP nydusd_lifetime_event_times nydusd lifetime event times.
# TYPE nydusd_lifetime_event_times counter
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="DIED",time="2022-11-25 17:55:45.051"} 1
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="INIT",time="2022-11-25 18:05:28.848"} 1
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="READY",time="2022-11-25 18:05:28.849"} 1
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="RUNNING",time="2022-11-25 17:55:11.463"} 1
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="RUNNING",time="2022-11-25 17:55:26.676"} 1
nydusd_lifetime_event_times{daemon_id="cdvh2ds58hm9ebu61990",event="RUNNING",time="2022-11-25 17:55:26.704"} 1
nydusd_lifetime_event_times{daemon_id="ce0920c58hm43as257lg",event="DIED",time="2022-11-25 17:55:48.402"} 1
nydusd_lifetime_event_times{daemon_id="ce0920c58hm43as257lg",event="RUNNING",time="2022-11-25 17:55:14.121"} 1
nydusd_lifetime_event_times{daemon_id="ce0920c58hm43as257lg",event="RUNNING",time="2022-11-25 17:55:14.322"} 1
nydusd_lifetime_event_times{daemon_id="ce0921s58hm43as257m0",event="DIED",time="2022-11-25 17:55:51.832"} 1
nydusd_lifetime_event_times{daemon_id="ce0921s58hm43as257m0",event="RUNNING",time="2022-11-25 17:55:20.141"} 1
nydusd_lifetime_event_times{daemon_id="ce0921s58hm43as257m0",event="RUNNING",time="2022-11-25 17:55:20.159"} 1

Signed-off-by: Bin Tang tangbin.bin@bytedance.com

@sctb512 sctb512 force-pushed the nydusd-event-metric branch 2 times, most recently from 431824b to b2fc494 Compare November 25, 2022 09:42
@sctb512 sctb512 changed the title metrics: collect the nydusd events metrics: collect the metrics of nydusd events Nov 25, 2022
@codecov-commenter
Copy link

Codecov Report

Base: 33.30% // Head: 33.61% // Increases project coverage by +0.31% 🎉

Coverage data is based on head (b2fc494) compared to base (84d9ba3).
Patch coverage: 33.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #263      +/-   ##
==========================================
+ Coverage   33.30%   33.61%   +0.31%     
==========================================
  Files          29       29              
  Lines        3099     3106       +7     
==========================================
+ Hits         1032     1044      +12     
+ Misses       1958     1953       -5     
  Partials      109      109              
Impacted Files Coverage Δ
pkg/manager/daemon_adaptor.go 0.00% <0.00%> (ø)
pkg/manager/manager.go 20.39% <0.00%> (-0.16%) ⬇️
pkg/metrics/ttl/gauge.go 100.00% <ø> (ø)
pkg/manager/monitor.go 60.28% <100.00%> (+0.28%) ⬆️
pkg/auth/kubesecret.go 36.58% <0.00%> (+8.94%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sctb512 sctb512 force-pushed the nydusd-event-metric branch 2 times, most recently from e76fe24 to b1e3197 Compare November 25, 2022 10:53
* Copyright (c) 2021. Alibaba Cloud. All rights reserved.
*
* SPDX-License-Identifier: Apache-2.0
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use license claims like:

/*
 * Copyright (c) 2022. Nydus Developers. All rights reserved.
 *
 * SPDX-License-Identifier: Apache-2.0
 */

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, already fixed it.

Help: "nydusd lifetime event times.",
},
[]string{daemonIDLabel, timeLabel, eventLabel},
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot package is an implementation of Containerd remote snapshotter. It does not know about nydusd daemons backing the nydus meta layer snapshot. Can we move the code to the daemon package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, done.

NydusdEvent = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "nydusd_lifetime_event_times",
Help: "nydusd lifetime event times.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming it to nydusd_lifetime_events

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks.

@@ -113,6 +114,10 @@ func NewFileSystem(ctx context.Context, opt ...NewFSOpt) (*Filesystem, error) {
if err := d.WaitUntilState(types.DaemonStateRunning); err != nil {
return nil, errors.Wrapf(err, "wait for daemon %s", d.ID())
}
if err := exporter.ExportNydusdEventMetric(d.States.ID, string(types.DaemonStateRunning)); err != nil {
log.L.Errorf("export nydusd event metric failed, daemon ID: %s, event: %s, error: %v", d.States.ID, string(types.DaemonStateRunning), err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about recording nydusd lifetime event in function WaitUntilState thus not calling it right after calling WaitUntilState

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func (mexp *MutexExporter) Get() *Exporter {
mexp.mu.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need a lock here? Is there a race condition?

Copy link
Member Author

@sctb512 sctb512 Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no race condition now. I already removed it.

Copy link
Member

@changweige changweige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help check nydus-snapshotter resource consumption if enabling metrics after running more than 10 minutes with 5 container images?

@sctb512
Copy link
Member Author

sctb512 commented Nov 28, 2022

Can you help check nydus-snapshotter resource consumption if enabling metrics after running more than 10 minutes with 5 container images?

containers:

nerdctl ps -a
CONTAINER ID    IMAGE                                              COMMAND         CREATED           STATUS    PORTS    NAMES
3a45e9a3c9b1    alpine:nydusv6     "sleep 1200"    16 minutes ago    Up                 alpine-3a45e
6e21ffbcb33f    python:nydusv6     "sleep 1200"    16 minutes ago    Up                 python-6e21f
c116554c0584    debian:nydusv6     "sleep 1200"    16 minutes ago    Up                 debian-c1165
c23ca6ff0674    busybox:nydusv6    "sleep 1200"    16 minutes ago    Up                 busybox-c23ca
c7068ea2882c    golang:nydusv6     "sleep 1200"    16 minutes ago    Up                 golang-c7068

enable metrics:
start container time: 2022-11-28 03:09

{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:28"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:29"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:30"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:31"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:32"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:33"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:34"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:35"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:36"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:37"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:38"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:39"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 24, "num_threads": 11, "time": "2022-11-28 03:20:40"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:41"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:42"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:43"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:44"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:45"}
{"cpu_times": {"user": 0.22, "system": 0.07}, "cpu_percent": 0.0, "memory": {"rss": 32247808, "vms": 1511743488}, "num_fds": 19, "num_threads": 11, "time": "2022-11-28 03:20:46"}

disable metrics:
start container time: 2022-11-28 03:28

{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:57"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:58"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:42:59"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:00"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:01"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:02"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:03"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:04"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:05"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:06"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:07"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:08"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:09"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:10"}
{"cpu_times": {"user": 0.16, "system": 0.05}, "cpu_percent": 0.0, "memory": {"rss": 30568448, "vms": 1435717632}, "num_fds": 18, "num_threads": 10, "time": "2022-11-28 03:43:11"}

@sctb512 sctb512 force-pushed the nydusd-event-metric branch 2 times, most recently from 404c2d3 to 03d467d Compare November 28, 2022 04:00
@@ -16,6 +16,7 @@ import (

var (
defaultCleanUpPeriod = 10 * time.Minute
DefaultTTL = 3 * time.Minute
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For snapshots daemons events, it's better to keep the metrics 2 days or longer. It does not occupy so much memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do not use TTL for snapshots daemons events. It just works for FS Metrics.

@sctb512 sctb512 force-pushed the nydusd-event-metric branch 2 times, most recently from 19f0a66 to d79e469 Compare November 28, 2022 06:24
@@ -201,6 +202,10 @@ func (d *Daemon) WaitUntilState(expected types.DaemonState) error {
_, err, shared := d.stateGetterGroup.Do(d.ID(), stateGetter)
log.L.Debugf("Get daemon %s with shared result: %v ", d.ID(), shared)

if exportErr := exporter.ExportNydusdEventMetric(d.States.ID, string(expected)); exportErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check err if it is nil. Then record the metric

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


daemon.NydusdEvent.WithLabelValues(daemonID, time.Now().Format("2006-01-02 15:04:05.000"), event).Inc()

return e.output()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad idea to write metrics to files each time this function is called. It will harm snapshotter performance.
I suppose only write them out when querying metrics via HTTP api

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -69,6 +88,17 @@ func (e *Exporter) ExportFsMetrics(m *types.FsMetrics, imageRef string) error {
return e.output()
}

func ExportNydusdEventMetric(daemonID string, event string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we can rename this function to RecordDaemonEvent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Export means the metrics is consumed and passed to another subsystem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is right. I have Fixed it.

@changweige
Copy link
Member

I can get two duplicated event records applying this PR

nydusd_lifetime_events{daemon_id="ce25h4k58hmdn3qbm6ng",event="RUNNING",time="2022-11-28 14:43:30.624"} 1
nydusd_lifetime_events{daemon_id="ce25h4k58hmdn3qbm6ng",event="RUNNING",time="2022-11-28 14:43:30.803"} 1

@changweige
Copy link
Member

Can enable nydus-snapshotter metrics by default? --enable-metrics only controls nydusd metrics

@changweige
Copy link
Member

Um... Seems you forgot to record the event when nydusd exits normally in DestroyDaemon

@@ -40,32 +40,33 @@ func WithOutputFile(metricsFile string) Opt {
}
}

func NewExporter(opts ...Opt) (*Exporter, error) {
func NewExporter(opts ...Opt) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to initialize the exporter in the package init() function which guarantees that it must be successfully initialized.
Then RecordDaemonEvent() can't fail which makes it neater without checking its results. we can ignore its absence.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func RecordDaemonEvent(daemonID string, event string) error {
if _, err := getExporter(); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fails to get the global exporter, it just ignore its absence and return success to caller

@@ -0,0 +1,41 @@
package exporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have license claim

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -188,6 +189,10 @@ func (d *Daemon) WaitUntilState(expected types.DaemonState) error {
d.ID(), expected, state)
}

if exportErr := exporter.RecordDaemonEvent(d.ID(), string(expected)); exportErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename exporter to collector, which means the metrics is not exported to other system yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, finished.

@@ -26,6 +24,8 @@ type Exporter struct {
outputFile string
}

var globalExporter *Exporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exporter should be defined as a interface to other system to fetch metrics via HTTP or RPC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, please take a look.

Copy link
Member

@changweige changweige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, looks good to me

@sctb512 sctb512 force-pushed the nydusd-event-metric branch 3 times, most recently from 05c4878 to ef8dc62 Compare November 29, 2022 03:45
Collect the metrics of nydus daemon events, including INIT, RUNNING and DIED.

Signed-off-by: Bin Tang <tangbin.bin@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants