Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print more log messages to enable tracking of SLIs #110

Merged
merged 4 commits into from
Nov 16, 2022

Conversation

ochosi
Copy link
Collaborator

@ochosi ochosi commented Nov 15, 2022

In order to track service level indicators, we need to print more logs.

For the Koji Hub plugin, this means we print a message each time a request is received by the plugin. A second log message is printed once the task has been successfully created in Koji Hub's database. This also will allow us to measure the time it takes for Koji Hub to insert a task into its database.
Example messages:

[Tue Nov 15 19:43:05.996844 2022] [wsgi:error] [pid 17:tid 55] [remote 10.89.0.1:38666] 996 [INFO] m=None u=None p=17 r=?:? koji.plugins: Create osbuildImage task
[...]
[Tue Nov 15 19:43:06.003731 2022] [wsgi:error] [pid 17:tid 55] [remote 10.89.0.1:38666] 2022-11-15 19:43:06,003 [INFO] m=osbuildImage u=kojiadmin p=17 r=10.89.0.1:38666 koji.plugins: osbuildImage task 3 added to database

For the Koji Builder plugin we print a log message once a task has been received. We print a second log message once composer returns a compose id. This means the plugin completed one iteration successfully.

These logs are ingested into Splunk. In a dashboard we can then track the two subsequent requests and the duration between them.

As the newer version of pylint available in Fedora 37 that was just released fails with the current code base, I pinned the container so this can be addressed in a separate, subsequent PR.

thozza
thozza previously approved these changes Nov 16, 2022
Copy link
Member

@thozza thozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. I added one nitpick and have one more general comment, although I think that you were trying to keep the changes as minimal as possible.

Since you added the names logger in the hub plugin, I winder whether it wouldn't be better to rework the builder plugin to also get and use a logger with the same name, instead of using self.logger.info() directly... 🤔

plugins/hub/osbuild.py Outdated Show resolved Hide resolved
@ochosi
Copy link
Collaborator Author

ochosi commented Nov 16, 2022

Changes look good to me. I added one nitpick and have one more general comment, although I think that you were trying to keep the changes as minimal as possible.

Indeed.

Since you added the names logger in the hub plugin, I winder whether it wouldn't be better to rework the builder plugin to also get and use a logger with the same name, instead of using self.logger.info() directly... thinking

That might make sense!
At the same time we are most likely the only ones ever interested in these logs and as far as I can tell they are unique enough to enable easy matching.

thozza
thozza previously approved these changes Nov 16, 2022
Copy link
Member

@thozza thozza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I would still use the named logger in the builder plugin just for the sake of consistency 😇

@teg
Copy link
Member

teg commented Nov 16, 2022

Love this, thanks @ochosi !

Will this allow us to measure end to end success ratio in the builder (and the hub for that matter), i.e. not only that the build was started successfully but that it completed with success?

@ochosi
Copy link
Collaborator Author

ochosi commented Nov 16, 2022

Love this, thanks @ochosi !

Will this allow us to measure end to end success ratio in the builder (and the hub for that matter), i.e. not only that the build was started successfully but that it completed with success?

If a compose id is returned we assume that the compose worked, yes. There is no further evidence about the state of the compose in koji-osbuild, as far as I can tell. The rest would be in composer. We can hopefully somehow plot this side-by-side though, even if we cannot combine the metrics.

In the hub we can only measure if the (valid) requests that come in were correctly added to Koji's database, cause that's really all the hub plugin does. (It validates the json and then creates a task in the db.)

plugins/hub/osbuild.py Outdated Show resolved Hide resolved
@ochosi ochosi force-pushed the additional-logs-for-SLIs branch 3 times, most recently from 5a60399 to aa5e8b4 Compare November 16, 2022 14:57
@ochosi
Copy link
Collaborator Author

ochosi commented Nov 16, 2022

I apologize for the many force pushes. I hope the PR is ok now.

@teg
Copy link
Member

teg commented Nov 16, 2022

If a compose id is returned we assume that the compose worked, yes. There is no further evidence about the state of the compose in koji-osbuild, as far as I can tell. The rest would be in composer. We can hopefully somehow plot this side-by-side though, even if we cannot combine the metrics.

Well, there is

self.logger.info("Compose result: %s", status.status)
, no? That would allow us to track the successful composes (rather than the successful starts). That's already logged though, so not relevant to this PR.

@ochosi
Copy link
Collaborator Author

ochosi commented Nov 16, 2022

Well, there is

self.logger.info("Compose result: %s", status.status)

, no? That would allow us to track the successful composes (rather than the successful starts). That's already logged though, so not relevant to this PR.

Very true, that's already appearing in the Splunk logs. The idea of tracking the compose id as service level indicator separately is that we cannot guarantee that the compose will be successful. It seems interesting and worth tracking, but it is (as far as I understand this) not directly related to the plugin's functionality.
What we can track for the performance of the plugin itself is whether the builder plugin submits composes correctly to composer.

plugins/builder/osbuild.py Outdated Show resolved Hide resolved
Each 'Task id' corresponds to a 'Compose id' in case everything works as
expected. In order to be able to track both in Splunk to measure our
first service level indicator (SLI) we need to explicitly log the 'Task
id' when it is received by the plugin.
Log both the entrypoint and the return value from adding a task to
Koji's database. We can measure both to ensure a task has been
successfully added to the database as a service level indicator.
This is so that new pylint errors with the version in Fedora 37 can be
fixed in a separate, subsequent PR.
Copy link
Contributor

@gicmo gicmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gicmo gicmo enabled auto-merge (rebase) November 16, 2022 17:13
@gicmo gicmo merged commit 292e8c9 into osbuild:main Nov 16, 2022
@teg
Copy link
Member

teg commented Nov 17, 2022

Very true, that's already appearing in the Splunk logs.

Perfect :)

The idea of tracking the compose id as service level indicator separately is that we cannot guarantee that the compose will be successful. It seems interesting and worth tracking, but it is (as far as I understand this) not directly related to the plugin's functionality.
What we can track for the performance of the plugin itself is whether the builder plugin submits composes correctly to composer.

I think we should track both. If the "outer" SLI fails it will be down to either the "inner" one or composer so good to have the possibility of digging deeper. But I think the interesting thing to measure is the overall success rate, whether issues are due to the plugin or our dependencies.

@gicmo
Copy link
Contributor

gicmo commented Nov 17, 2022

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

@ochosi
Copy link
Collaborator Author

ochosi commented Nov 17, 2022

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

That sounds like a very good idea. I wonder if the retries are already logged in some way or if we have to log them explicitly.

@teg
Copy link
Member

teg commented Nov 17, 2022

We also have a Retry adaptor on the http class, I was wondering if it might make sense to track the number of retries here too.

That sounds like a very good idea. I wonder if the retries are already logged in some way or if we have to log them explicitly.

@diaasami was tracking retries at some point, not sure if these are the same ones though.

@diaasami
Copy link

diaasami commented Nov 17, 2022

@diaasami was tracking retries at some point.

Yes, these are still being tracked and the frequency has decreased a lot, we have a week without a single retry every few weeks.

not sure if these are the same ones though.

They are not, the retries in koji-osbuild are retries in the connection to composer, while the ones in the workers (that I added) are retries while authenticating or uploading the image to koji.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants