6 scheduling observer extraction #38

vlerkin · 2024-11-15T13:48:17Z

This PR was extracted from the 6-scheduling branch to encapsulate logic for changes in resource watcher and log handler. A couple of changes was added to make watcher more resilient, also the comment about cleaning api.py from k8s resource watcher was taken into account.

What's new?

The watcher was refactored to become an Observer who notifies subscribers [log handler] when another event is received.
The observer itself is wrapped in the logic to gracefully reconnect if the connection with k8s API is lost, it contains backoff_time, backoff_coefficient and reconnection_attempts that are configurable via config file to prevent k8s API infinite flooding. A couple of exceptions are added to handle cases that have been already encountered and can be potentially faced.
File api.py is refactored to remove resource watcher logic and encapsulate it within k8s launcher, a new method _init_resource_watcher was created as part of K8s class to handle the logic of resource watcher and joblogs feature initialization, also joblogs/init.py was refactored and cleaned since the current implementation of _init_resource_watcher and enable_joblogs is sufficient to do the job.

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

…source watcher instance to enable_joblogs to subscribe to the event watcher if the log feature is configured; delete logic about event watcher from main; pass container for list objects function instead of container name; remove start methon from log handler class; modify joblogs init to subscribe to event watcher

…nnect loop for event watcher; make number of reconnect attempts, backoff time and a coefficient for exponential growth configurable via config; add backoff_time, reconnection_attempts and backoff_coefficient as attributes to the resource watcher init; add resource_version as a param to w.stream so a failed stream can read from the last resource it was able to catch; add urllib3.exceptions.ProtocolError and handle reconnection after some exponential backoff time to avoid api flooding; add config as a param for init for resource watcher; modify config in kubernetes.yaml and k8s config to contain add backoff_time, reconnection_attempts and backoff_coefficient

… connection to the k8s was achieved so only sequential failures detected; add exception handling to watch_pods to handle failure in urllib3, when source version is old and not available anymore, and when stream is ended; remove k8s resource watcher initialization from run function in api.py and move it to k8s.py launcher as _init_resource_watcher; refactor existing logic from joblogs/__init__.py to keep it in _init_resource_watcher and enable_joblogs in k8s launcher

wvengen

Great! Did a first review pass. Some small (I think) questions.
Very nice, seems well handled.

Would you be able to explain a bit more how the resource_version check works? I'm not familiar with it. If this is "common k8s knowledge", then I need to RTFM, otherwise a comment block in the code would be useful.

wvengen · 2024-11-18T09:10:49Z

kubernetes.yaml

+
+    max_proc     = 2
+    reconnection_attempts = 5
+    backoff_time = 5
+    backoff_coefficient = 2


Can we have sensible defaults, and not have to think about this yet?
Then we can add a more detailed configuration documentation to document this. I think ideally, no one would need to touch these (unless you have very different network requirements) - I'd like the default to work for most cases.

I think a section in the README would make sense here. Though might it be time to create a new file explaining the configuration options in more detail? That could give some more freedom to explain them in more detail.

There are defaults in the code, but they need to be "tested" to see if it's enough for our prod cluster, I suspect that it might be possible that we need to increase the number of attempts, but I tried to make it resilient in a sense that every time the connection was successfully re-established, the number of attempts sets to provided default number, so we really catch the cases when we have several connection breaks in a row.

Yes, I do agree that a section about the config file is needed. How do you see this new file, just CONFIG.md or something else as part of the repo?

Yes, CONFIG.md sounds fine 👍

Super, thanks! Pending:

Remove the variables from the sample config files (defaults are meant to be ok, and are documented)

Move config info from README.md to CONFIG.md

Make sure CONFIG.md is linked to from README.md at a sensible place.

Thank you for the feedback, done!

wvengen · 2024-11-18T09:11:28Z

scrapyd_k8s.sample-k8s.conf

+# Number of attempts to reconnect with k8s API to watch events, default is 5
+reconnection_attempts = 5
+
+# Minimum time in seconds to wait before reconnecting to k8s API to watch events, default is 5
+backoff_time = 5
+
+# Coefficient that is multiplied by backoff_time to provide exponential backoff to prevent k8s API from being overwhelmed
+# default is 2, every reconnection attempt will take backoff_time*backoff_coefficient
+backoff_coefficient = 2
+


(see previous comment)

wvengen · 2024-11-18T09:23:26Z

p.s. thank you so much for seperating this PR off #36 🙏

vlerkin · 2024-11-18T10:36:03Z

About resource versions. If you want to just look at the docs, it is described over here:
https://kubernetes.io/docs/reference/using-api/api-concepts/#resource-versions

Otherwise, I will explain a bit if it's ok.
Every object in k8s has a metadata.resourceVersion field, when we receive a stream of events from k8s API it is available by default. Throughout the whole lifecycle the resource version of each object is being updated, namely, incremented with every update of the resource. When we leverage the k8s watcher to track objects and suddenly lose connection, we can pass the last resource version we saw to say "hey, give me the stream of events starting with this particular object change", so we don't have to go through the whole stream again and we do not lose the updates.
If resource version is too old, we encounter 410 Gone error, it happens because the retention window is several minutes, if we did not manage to establish a new connection to the k8s API within that window, we handle this case by setting resource version to None and establish connection to receive the available data within the available window.
And yes, a resource version is a common thing in k8s, but it's optional to use.

I hope my explanation is not too lame:D

…ed to re-establish connection to the Kubernetes wather

…k to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.

wvengen · 2024-11-19T13:08:17Z

kubernetes.yaml

@@ -87,6 +87,8 @@ data:
    launcher     = scrapyd_k8s.launcher.K8s

    namespace    = default
+
+    max_proc     = 2


Why is max_proc here 2, and in scrapyd_k8s.sample-k8s.conf 10?

wvengen · 2024-11-19T13:10:02Z

Thanks! I'll merge and start testing, then we can tie up any loose doc ends as the project progresses.

vlerkin added 4 commits November 14, 2024 17:19

extracted resource watching logic into a separate class; implemented …

fdc0a6d

…logic for observer in a RecourceWatcher class; added method to stop a thread gracefully

vlerkin requested a review from wvengen November 15, 2024 13:48

wvengen reviewed Nov 18, 2024

View reviewed changes

vlerkin added 2 commits November 19, 2024 11:53

added a CONFIG.md file with detailed explanations about parameters us…

515642f

…ed to re-establish connection to the Kubernetes wather

move section about config file from README.md to CONFIG.md; add a lin…

8fdcb60

…k to the CONFIG.md in the README.md; remove variables for reconnection_attempts, backoff_time and backoff_coefficient fron the sample config since default values are provided in the code.

vlerkin mentioned this pull request Nov 19, 2024

6 scheduling #36

Open

wvengen reviewed Nov 19, 2024

View reviewed changes

wvengen merged commit f36de79 into q-m:main Nov 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6 scheduling observer extraction #38

6 scheduling observer extraction #38

vlerkin commented Nov 15, 2024

wvengen left a comment

wvengen Nov 18, 2024

wvengen Nov 18, 2024

vlerkin Nov 18, 2024

vlerkin Nov 18, 2024

wvengen Nov 18, 2024

wvengen Nov 19, 2024 •

edited by vlerkin

Loading

vlerkin Nov 19, 2024

wvengen Nov 18, 2024

wvengen commented Nov 18, 2024

vlerkin commented Nov 18, 2024

wvengen Nov 19, 2024

wvengen commented Nov 19, 2024

6 scheduling observer extraction #38

6 scheduling observer extraction #38

Conversation

vlerkin commented Nov 15, 2024

wvengen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wvengen Nov 19, 2024 • edited by vlerkin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wvengen commented Nov 18, 2024

vlerkin commented Nov 18, 2024

Choose a reason for hiding this comment

wvengen commented Nov 19, 2024

wvengen Nov 19, 2024 •

edited by vlerkin

Loading