Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6 scheduling observer extraction #38

Merged
merged 6 commits into from
Nov 19, 2024

Conversation

vlerkin
Copy link
Collaborator

@vlerkin vlerkin commented Nov 15, 2024

This PR was extracted from the 6-scheduling branch to encapsulate logic for changes in resource watcher and log handler. A couple of changes was added to make watcher more resilient, also the comment about cleaning api.py from k8s resource watcher was taken into account.

What's new?

  1. The watcher was refactored to become an Observer who notifies subscribers [log handler] when another event is received.
  2. The observer itself is wrapped in the logic to gracefully reconnect if the connection with k8s API is lost, it contains backoff_time, backoff_coefficient and reconnection_attempts that are configurable via config file to prevent k8s API infinite flooding. A couple of exceptions are added to handle cases that have been already encountered and can be potentially faced.
  3. File api.py is refactored to remove resource watcher logic and encapsulate it within k8s launcher, a new method _init_resource_watcher was created as part of K8s class to handle the logic of resource watcher and joblogs feature initialization, also joblogs/init.py was refactored and cleaned since the current implementation of _init_resource_watcher and enable_joblogs is sufficient to do the job.

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully
…source watcher instance to enable_joblogs to subscribe to the event watcher if the log feature is configured; delete logic about event watcher from main; pass container for list objects function instead of container name; remove start methon from log handler class; modify joblogs init to subscribe to event watcher
…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient
… connection to the k8s was achieved so only sequential failures detected; add exception handling to watch_pods to handle failure in urllib3, when source version is old and not available anymore, and when stream is ended; remove k8s resource watcher initialization from run function in api.py and move it to k8s.py launcher as _init_resource_watcher; refactor existing logic from joblogs/__init__.py to keep it in _init_resource_watcher and enable_joblogs in k8s launcher
@vlerkin vlerkin requested a review from wvengen November 15, 2024 13:48
Copy link
Member

@wvengen wvengen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Did a first review pass. Some small (I think) questions.
Very nice, seems well handled.

Would you be able to explain a bit more how the resource_version check works? I'm not familiar with it. If this is "common k8s knowledge", then I need to RTFM, otherwise a comment block in the code would be useful.

kubernetes.yaml Outdated
Comment on lines 90 to 94

max_proc = 2
reconnection_attempts = 5
backoff_time = 5
backoff_coefficient = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have sensible defaults, and not have to think about this yet?
Then we can add a more detailed configuration documentation to document this. I think ideally, no one would need to touch these (unless you have very different network requirements) - I'd like the default to work for most cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a section in the README would make sense here. Though might it be time to create a new file explaining the configuration options in more detail? That could give some more freedom to explain them in more detail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are defaults in the code, but they need to be "tested" to see if it's enough for our prod cluster, I suspect that it might be possible that we need to increase the number of attempts, but I tried to make it resilient in a sense that every time the connection was successfully re-established, the number of attempts sets to provided default number, so we really catch the cases when we have several connection breaks in a row.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do agree that a section about the config file is needed. How do you see this new file, just CONFIG.md or something else as part of the repo?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CONFIG.md sounds fine 👍

Copy link
Member

@wvengen wvengen Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, thanks! Pending:

  • Remove the variables from the sample config files (defaults are meant to be ok, and are documented)
  • Move config info from README.md to CONFIG.md
  • Make sure CONFIG.md is linked to from README.md at a sensible place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, done!

Comment on lines 25 to 34
# Number of attempts to reconnect with k8s API to watch events, default is 5
reconnection_attempts = 5

# Minimum time in seconds to wait before reconnecting to k8s API to watch events, default is 5
backoff_time = 5

# Coefficient that is multiplied by backoff_time to provide exponential backoff to prevent k8s API from being overwhelmed
# default is 2, every reconnection attempt will take backoff_time*backoff_coefficient
backoff_coefficient = 2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see previous comment)

@wvengen
Copy link
Member

wvengen commented Nov 18, 2024

p.s. thank you so much for seperating this PR off #36 🙏

@vlerkin
Copy link
Collaborator Author

vlerkin commented Nov 18, 2024

About resource versions. If you want to just look at the docs, it is described over here:
https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

Otherwise, I will explain a bit if it's ok.
Every object in k8s has a metadata.resourceVersion field, when we receive a stream of events from k8s API it is available by default. Throughout the whole lifecycle the resource version of each object is being updated, namely, incremented with every update of the resource. When we leverage the k8s watcher to track objects and suddenly lose connection, we can pass the last resource version we saw to say "hey, give me the stream of events starting with this particular object change", so we don't have to go through the whole stream again and we do not lose the updates.
If resource version is too old, we encounter 410 Gone error, it happens because the retention window is several minutes, if we did not manage to establish a new connection to the k8s API within that window, we handle this case by setting resource version to None and establish connection to receive the available data within the available window.
And yes, a resource version is a common thing in k8s, but it's optional to use.

I hope my explanation is not too lame:D

…ed to re-establish connection to the Kubernetes wather
…k to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.
@vlerkin vlerkin mentioned this pull request Nov 19, 2024
@@ -87,6 +87,8 @@ data:
launcher = scrapyd_k8s.launcher.K8s

namespace = default

max_proc = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is max_proc here 2, and in scrapyd_k8s.sample-k8s.conf 10?

@wvengen
Copy link
Member

wvengen commented Nov 19, 2024

Thanks! I'll merge and start testing, then we can tie up any loose doc ends as the project progresses.

@wvengen wvengen merged commit f36de79 into q-m:main Nov 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants