Print more log messages to enable tracking of SLIs #110

ochosi · 2022-11-15T18:18:01Z

In order to track service level indicators, we need to print more logs.

For the Koji Hub plugin, this means we print a message each time a request is received by the plugin. A second log message is printed once the task has been successfully created in Koji Hub's database. This also will allow us to measure the time it takes for Koji Hub to insert a task into its database.
Example messages:

[Tue Nov 15 19:43:05.996844 2022] [wsgi:error] [pid 17:tid 55] [remote 10.89.0.1:38666] 996 [INFO] m=None u=None p=17 r=?:? koji.plugins: Create osbuildImage task
[...]
[Tue Nov 15 19:43:06.003731 2022] [wsgi:error] [pid 17:tid 55] [remote 10.89.0.1:38666] 2022-11-15 19:43:06,003 [INFO] m=osbuildImage u=kojiadmin p=17 r=10.89.0.1:38666 koji.plugins: osbuildImage task 3 added to database

For the Koji Builder plugin we print a log message once a task has been received. We print a second log message once composer returns a compose id. This means the plugin completed one iteration successfully.

These logs are ingested into Splunk. In a dashboard we can then track the two subsequent requests and the duration between them.

As the newer version of pylint available in Fedora 37 that was just released fails with the current code base, I pinned the container so this can be addressed in a separate, subsequent PR.

thozza

Changes look good to me. I added one nitpick and have one more general comment, although I think that you were trying to keep the changes as minimal as possible.

Since you added the names logger in the hub plugin, I winder whether it wouldn't be better to rework the builder plugin to also get and use a logger with the same name, instead of using self.logger.info() directly... 🤔

plugins/hub/osbuild.py

ochosi · 2022-11-16T09:00:35Z

Changes look good to me. I added one nitpick and have one more general comment, although I think that you were trying to keep the changes as minimal as possible.

Indeed.

Since you added the names logger in the hub plugin, I winder whether it wouldn't be better to rework the builder plugin to also get and use a logger with the same name, instead of using self.logger.info() directly... thinking

That might make sense!
At the same time we are most likely the only ones ever interested in these logs and as far as I can tell they are unique enough to enable easy matching.

thozza

LGTM, although I would still use the named logger in the builder plugin just for the sake of consistency 😇

teg · 2022-11-16T11:18:13Z

Love this, thanks @ochosi !

Will this allow us to measure end to end success ratio in the builder (and the hub for that matter), i.e. not only that the build was started successfully but that it completed with success?

ochosi · 2022-11-16T13:30:26Z

Love this, thanks @ochosi !

Will this allow us to measure end to end success ratio in the builder (and the hub for that matter), i.e. not only that the build was started successfully but that it completed with success?

If a compose id is returned we assume that the compose worked, yes. There is no further evidence about the state of the compose in koji-osbuild, as far as I can tell. The rest would be in composer. We can hopefully somehow plot this side-by-side though, even if we cannot combine the metrics.

In the hub we can only measure if the (valid) requests that come in were correctly added to Koji's database, cause that's really all the hub plugin does. (It validates the json and then creates a task in the db.)

plugins/hub/osbuild.py

ochosi · 2022-11-16T14:58:02Z

I apologize for the many force pushes. I hope the PR is ok now.

teg · 2022-11-16T15:38:40Z

If a compose id is returned we assume that the compose worked, yes. There is no further evidence about the state of the compose in koji-osbuild, as far as I can tell. The rest would be in composer. We can hopefully somehow plot this side-by-side though, even if we cannot combine the metrics.

Well, there is

koji-osbuild/plugins/builder/osbuild.py

Line 722 in 741be47

self.logger.info("Compose result: %s", status.status)

, no? That would allow us to track the successful composes (rather than the successful starts). That's already logged though, so not relevant to this PR.

ochosi · 2022-11-16T16:22:55Z

Well, there is

koji-osbuild/plugins/builder/osbuild.py

Line 722 in 741be47

self.logger.info("Compose result: %s", status.status)

, no? That would allow us to track the successful composes (rather than the successful starts). That's already logged though, so not relevant to this PR.

Very true, that's already appearing in the Splunk logs. The idea of tracking the compose id as service level indicator separately is that we cannot guarantee that the compose will be successful. It seems interesting and worth tracking, but it is (as far as I understand this) not directly related to the plugin's functionality.
What we can track for the performance of the plugin itself is whether the builder plugin submits composes correctly to composer.

plugins/builder/osbuild.py

Each 'Task id' corresponds to a 'Compose id' in case everything works as expected. In order to be able to track both in Splunk to measure our first service level indicator (SLI) we need to explicitly log the 'Task id' when it is received by the plugin.

Log both the entrypoint and the return value from adding a task to Koji's database. We can measure both to ensure a task has been successfully added to the database as a service level indicator.

This is so that new pylint errors with the version in Fedora 37 can be fixed in a separate, subsequent PR.

gicmo

👍

teg · 2022-11-17T09:13:18Z

Very true, that's already appearing in the Splunk logs.

Perfect :)

The idea of tracking the compose id as service level indicator separately is that we cannot guarantee that the compose will be successful. It seems interesting and worth tracking, but it is (as far as I understand this) not directly related to the plugin's functionality.
What we can track for the performance of the plugin itself is whether the builder plugin submits composes correctly to composer.

I think we should track both. If the "outer" SLI fails it will be down to either the "inner" one or composer so good to have the possibility of digging deeper. But I think the interesting thing to measure is the overall success rate, whether issues are due to the plugin or our dependencies.

gicmo · 2022-11-17T09:16:45Z

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

ochosi · 2022-11-17T09:19:21Z

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

That sounds like a very good idea. I wonder if the retries are already logged in some way or if we have to log them explicitly.

teg · 2022-11-17T11:03:44Z

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

That sounds like a very good idea. I wonder if the retries are already logged in some way or if we have to log them explicitly.

@diaasami was tracking retries at some point, not sure if these are the same ones though.

diaasami · 2022-11-17T12:20:41Z

@diaasami was tracking retries at some point.

Yes, these are still being tracked and the frequency has decreased a lot, we have a week without a single retry every few weeks.

not sure if these are the same ones though.

They are not, the retries in koji-osbuild are retries in the connection to composer, while the ones in the workers (that I added) are retries while authenticating or uploading the image to koji.

ochosi requested a review from gicmo November 15, 2022 18:18

ochosi force-pushed the additional-logs-for-SLIs branch from 023ba48 to 30689cc Compare November 15, 2022 19:28

thozza previously approved these changes Nov 16, 2022

View reviewed changes

plugins/hub/osbuild.py Outdated Show resolved Hide resolved

ochosi dismissed thozza’s stale review via 8e80eb2 November 16, 2022 08:58

ochosi force-pushed the additional-logs-for-SLIs branch from 883a0fc to 8e80eb2 Compare November 16, 2022 08:58

thozza previously approved these changes Nov 16, 2022

View reviewed changes

gicmo reviewed Nov 16, 2022

View reviewed changes

plugins/hub/osbuild.py Outdated Show resolved Hide resolved

gicmo reviewed Nov 16, 2022

View reviewed changes

plugins/hub/osbuild.py Show resolved Hide resolved

ochosi dismissed thozza’s stale review via 74f6530 November 16, 2022 14:00

ochosi force-pushed the additional-logs-for-SLIs branch 3 times, most recently from 5a60399 to aa5e8b4 Compare November 16, 2022 14:57

gicmo reviewed Nov 16, 2022

View reviewed changes

plugins/builder/osbuild.py Outdated Show resolved Hide resolved

ochosi added 4 commits November 16, 2022 17:37

hub: Log adding tasks to Koji's db

a7e40d4

Log both the entrypoint and the return value from adding a task to Koji's database. We can measure both to ensure a task has been successfully added to the database as a service level indicator.

ci: Pin Fedora container for pylint

ee2d81a

This is so that new pylint errors with the version in Fedora 37 can be fixed in a separate, subsequent PR.

builder: Fix typo

70ab44c

ochosi force-pushed the additional-logs-for-SLIs branch from aa5e8b4 to 70ab44c Compare November 16, 2022 16:38

gicmo approved these changes Nov 16, 2022

View reviewed changes

gicmo enabled auto-merge (rebase) November 16, 2022 17:13

gicmo merged commit 292e8c9 into osbuild:main Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print more log messages to enable tracking of SLIs #110

Print more log messages to enable tracking of SLIs #110

ochosi commented Nov 15, 2022 •

edited

Loading

thozza left a comment

ochosi commented Nov 16, 2022

thozza left a comment

teg commented Nov 16, 2022

ochosi commented Nov 16, 2022

ochosi commented Nov 16, 2022

teg commented Nov 16, 2022

ochosi commented Nov 16, 2022

gicmo left a comment

teg commented Nov 17, 2022

gicmo commented Nov 17, 2022

ochosi commented Nov 17, 2022

teg commented Nov 17, 2022

diaasami commented Nov 17, 2022 •

edited

Loading

Print more log messages to enable tracking of SLIs #110

Print more log messages to enable tracking of SLIs #110

Conversation

ochosi commented Nov 15, 2022 • edited Loading

thozza left a comment

Choose a reason for hiding this comment

ochosi commented Nov 16, 2022

thozza left a comment

Choose a reason for hiding this comment

teg commented Nov 16, 2022

ochosi commented Nov 16, 2022

ochosi commented Nov 16, 2022

teg commented Nov 16, 2022

ochosi commented Nov 16, 2022

gicmo left a comment

Choose a reason for hiding this comment

teg commented Nov 17, 2022

gicmo commented Nov 17, 2022

ochosi commented Nov 17, 2022

teg commented Nov 17, 2022

diaasami commented Nov 17, 2022 • edited Loading

ochosi commented Nov 15, 2022 •

edited

Loading

diaasami commented Nov 17, 2022 •

edited

Loading