Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2686 add checkpoint with repoid #800

Merged
merged 17 commits into from
Oct 23, 2024
Merged

Conversation

adrian-codecov
Copy link
Contributor

We're extending our checkpoint metrics to add repo specific metrics. The original approach considered adding a repo_id label to the existing checkpoint logger, but this is an ask for a specific repository. This approach isn't necessarily scalable with many repositories as repo_id has high cardinality, which isn't very well suited for prometheus. The better approach would be to leverage sql metrics and extend their functionality, but this approach was chosen for times sake.

This PR

  • Adds a new Repository based checkpoints that expect the repo_id
  • Added a checkpoint context variable to supply a typed context to metrics
    • Adjusted files that instantiated a checkpoint class to provide said context
  • Removes statsd and tests

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. In 2022 this entity acquired Codecov and as result Sentry is going to need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

@codecov-qa
Copy link

codecov-qa bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 97.99%. Comparing base (6b1f38e) to head (cd80968).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
helpers/checkpoint_logger/prometheus.py 94.87% 2 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   97.99%   97.99%   -0.01%     
==========================================
  Files         446      446              
  Lines       36568    36630      +62     
==========================================
+ Hits        35835    35895      +60     
- Misses        733      735       +2     
Flag Coverage Δ
integration 97.99% <96.82%> (-0.01%) ⬇️
unit 97.99% <96.82%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.84% <95.00%> (-0.01%) ⬇️
OutsideTasks 97.96% <96.82%> (-0.01%) ⬇️
Files with missing lines Coverage Δ
helpers/tests/unit/test_checkpoint_logger.py 99.61% <100.00%> (+0.03%) ⬆️
rollouts/__init__.py 100.00% <100.00%> (ø)
tasks/notify_error.py 100.00% <ø> (ø)
helpers/checkpoint_logger/prometheus.py 96.77% <94.87%> (-3.23%) ⬇️

@codecov-notifications
Copy link

codecov-notifications bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
helpers/checkpoint_logger/prometheus.py 94.87% 2 Missing ⚠️

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   97.99%   97.99%   -0.01%     
==========================================
  Files         446      446              
  Lines       36568    36630      +62     
==========================================
+ Hits        35835    35895      +60     
- Misses        733      735       +2     
Flag Coverage Δ
integration 97.99% <96.82%> (-0.01%) ⬇️
unit 97.99% <96.82%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.84% <95.00%> (-0.01%) ⬇️
OutsideTasks 97.96% <96.82%> (-0.01%) ⬇️
Files with missing lines Coverage Δ
helpers/tests/unit/test_checkpoint_logger.py 99.61% <100.00%> (+0.03%) ⬆️
rollouts/__init__.py 100.00% <100.00%> (ø)
tasks/notify_error.py 100.00% <ø> (ø)
helpers/checkpoint_logger/prometheus.py 96.77% <94.87%> (-3.23%) ⬇️

Copy link

codecov-public-qa bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 97.99%. Comparing base (6b1f38e) to head (cd80968).

✅ All tests successful. No failed tests found.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   97.99%   97.99%   -0.01%     
==========================================
  Files         446      446              
  Lines       36568    36630      +62     
==========================================
+ Hits        35835    35895      +60     
- Misses        733      735       +2     
Flag Coverage Δ
integration 97.99% <96.82%> (-0.01%) ⬇️
unit 97.99% <96.82%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.84% <95.00%> (-0.01%) ⬇️
OutsideTasks 97.96% <96.82%> (-0.01%) ⬇️
Files Coverage Δ
helpers/tests/unit/test_checkpoint_logger.py 99.61% <100.00%> (+0.03%) ⬆️
rollouts/__init__.py 100.00% <100.00%> (ø)
tasks/notify_error.py 100.00% <ø> (ø)
helpers/checkpoint_logger/prometheus.py 96.77% <94.87%> (-3.23%) ⬇️

Copy link

codecov bot commented Oct 19, 2024

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 97.99%. Comparing base (6b1f38e) to head (cd80968).
Report is 1 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
helpers/checkpoint_logger/prometheus.py 94.87% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #800      +/-   ##
==========================================
- Coverage   97.99%   97.99%   -0.01%     
==========================================
  Files         446      446              
  Lines       36568    36630      +62     
==========================================
+ Hits        35835    35895      +60     
- Misses        733      735       +2     
Flag Coverage Δ
integration 97.99% <96.82%> (-0.01%) ⬇️
unit 97.99% <96.82%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
NonTestCode 95.84% <95.00%> (-0.01%) ⬇️
OutsideTasks 97.96% <96.82%> (-0.01%) ⬇️
Files with missing lines Coverage Δ
helpers/tests/unit/test_checkpoint_logger.py 99.61% <100.00%> (+0.03%) ⬆️
rollouts/__init__.py 100.00% <100.00%> (ø)
tasks/notify_error.py 100.00% <ø> (ø)
helpers/checkpoint_logger/prometheus.py 96.77% <94.87%> (-3.23%) ⬇️

def log_counters(obj: T) -> None:
metrics.incr(f"{klass.__name__}.events.{obj.name}")
PROMETHEUS_HANDLER.log_checkpoints(flow=klass.__name__, checkpoint=obj.name)
def log_counters(obj: T, context: CheckpointContext = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def log_counters(obj: T, context: CheckpointContext = None) -> None:
def log_counters(obj: T, context: CheckpointContext | None = None) -> None:

the same pattern repeats a bunch of times below. If the default is None, then None has to appear in the type as well.

):
self.cls = cls
self.data = data if data else {}
self.kwargs_key = _kwargs_key(self.cls)
self.strict = strict
self.context = context if context else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just be context in this case, as defaulting to {} does not really make sense.

@@ -67,28 +126,46 @@ class PrometheusCheckpointLoggerHandler:
methods in this class are mainly used by the CheckpointLogger class.
"""

def log_begun(self, flow: str):
def log_begun(self, flow: str, repo_id: int = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def log_begun(self, flow: str, repo_id: int = None):
def log_begun(self, flow: str, repo_id: int | None = None):

same here, the argument type needs to match up with the default value.

Comment on lines 452 to 454
kwargs: MutableMapping[str, Any],
strict: bool = False,
context: CheckpointContext = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this particular case, you could just pick repo_id (aka repoid?) out of the kwargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would work off the bat as the repoid has been specified as a keyword parameter, thus not belonging to **kwargs when it's supplied to the checkpoints functionality, for instance here

worker/tasks/upload.py

Lines 300 to 301 in 5514b97

checkpoints = upload_context.get_checkpoints(kwargs)
log.info("Received upload task", extra=upload_context.log_extra())
. Unless you meant re-adding it to the kwargs object before we supply it. As an add-on, I created the context to a) type the supplied params and b) serve as an object that has items additional to the checkpoint flow - I'm open to a different approach, I chose this one for readability + separation of concerns. (Although I will rename repo_id to repoid to be consistent with our model definition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed your other suggestions, thanks 🙏

helpers/checkpoint_logger/__init__.py Outdated Show resolved Hide resolved
cls: type[T],
kwargs: MutableMapping[str, Any],
strict: bool = False,
context: CheckpointContext | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you take the TypedDict/self.data["context"] suggestion, you won't need to take this in from_kwargs. you will need to do a little extra work though

def from_kwargs(...):
    # Copy so we don't modify the passed-in kwargs
    data = kwargs.get(_kwargs_key(cls), {}).copy()

    deserialized_data = {}
    # Remove the "context" key. All remaining keys should be castable to checkpoints
    deserialized_data["context"] = data.pop("context", CheckpointContext())

    # for loop can remain the same
    for checkpoints, timestamp in data.items():
        ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had to play with this a bit to make it work. The kwargs.get(_kwargs_key(cls), {}).copy() is an empty object the first time a flow is ran, so instead I did

def from_kwargs(
    cls: type[T],
    kwargs: MutableMapping[str, Any],
    strict: bool = False,
) -> CheckpointLogger[T]:
    context = kwargs.pop("context", CheckpointContext())
    data = kwargs.get(_kwargs_key(cls), {})

    # kwargs has been deserialized into a Python dictionary, but our enum values
    # are deserialized as simple strings. We need to ensure the strings are all
    # proper enum values as best we can, and then downcast to enum instances.
    deserialized_data = {}
    deserialized_data["context"] = context

    for checkpoint, timestamp in data.items():

it could be prettier but that's what I could think, wdyt?

helpers/checkpoint_logger/prometheus.py Outdated Show resolved Hide resolved
helpers/checkpoint_logger/prometheus.py Outdated Show resolved Hide resolved
helpers/checkpoint_logger/prometheus.py Outdated Show resolved Hide resolved
helpers/checkpoint_logger/prometheus.py Outdated Show resolved Hide resolved
helpers/checkpoint_logger/prometheus.py Outdated Show resolved Hide resolved
helpers/tests/unit/test_checkpoint_logger.py Show resolved Hide resolved
helpers/tests/unit/test_checkpoint_logger.py Show resolved Hide resolved
helpers/tests/unit/test_checkpoint_logger.py Show resolved Hide resolved
Comment on lines 445 to 453
context = kwargs.pop("context", CheckpointContext())
data = kwargs.get(_kwargs_key(cls), {})

# kwargs has been deserialized into a Python dictionary, but our enum values
# are deserialized as simple strings. We need to ensure the strings are all
# proper enum values as best we can, and then downcast to enum instances.
deserialized_data = {}
deserialized_data["context"] = context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a subtle bug here. you have to do data.pop("context") because otherwise the for loop later in this function that iterates over data.items() will explode when trying to cast "context" to an UploadFlow enum (or whichever flow). also i think kwargs.pop("context") will mutate the passed-in kwargs which might not be safe

taking a step back, i think we don't have a tidy way to manage this context. if we store it in self.data, then each flow's self.data has its own copy and that's redundant. if we put a single copy in various tasks' kwargs, we have to remember to manually pass it and make sure it's included in kwargs when we retry and yadda yadda. since we want to do similar things for log.info() calls (task_name, task_id) and SQL metrics (repo_id, owner_id, commit_sha), i wrote #810 which i think can neatly take care of it for all of them

@adrian-codecov adrian-codecov added this pull request to the merge queue Oct 23, 2024
Merged via the queue into main with commit ecfe031 Oct 23, 2024
24 of 27 checks passed
@adrian-codecov adrian-codecov deleted the 2686-add-checkpoint-with-repoid branch October 23, 2024 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants