Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC] KEP for Project 6: Push-based Metrics Collection for Katib #2328

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/images/push-based-metrics-collection.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
77 changes: 77 additions & 0 deletions docs/proposals/push-based-metrics-collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Push-based Metrics Collection Proposal

## Links

- [katib/issues#577([Enhancement Request] Metrics Collector Push-based Implementation)](https://github.com/kubeflow/katib/issues/577)

## Motivation

[Katib](https://github.com/kubeflow/katib) is a Kubernetes-native project for automated machine learning (AutoML). It can not only tune hyperparameters of applications written in any language and natively supports many ML frameworks, but also supports features like early stopping and neural architecture search.

In the procedure of tuning hyperparameters, Metrics Collector, which is implemented as a sidecar container attached to each training container in the [current design](https://github.com/kubeflow/katib/blob/master/docs/proposals/metrics-collector.md), will collect training logs from Trials once the training is complete. Then, the Metrics Collector will parse training logs to get appropriate metrics like accuracy or loss and pass the evaluation results to the HyperParameter tuning algorithm.

However, current implementation of Metrics Collector is pull-based, raising some [design problems](https://github.com/kubeflow/training-operator/issues/722#issuecomment-405669269) such as determining the frequency we scrape the metrics, performance issues like the overhead caused by too many sidecar containers, and restrictions on developing environments which must support sidecar containers. Thus, we should implement a new API for Katib Python SDK to offer users a push-based way to store metrics directly into the Kaitb DB and resolve those issues raised by pull-based metrics collection.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

![](../images/push-based-metrics-collection.png)
Fig.1 Architecture of the new design

## Goal
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved
1. **A new parameter in Python SDK function `tune`**: allow users to specify the method of collecting metrics(push-based/pull-based).
2. **A code injection function in mutating webhook**: recognize the metrics output lines and replace them with push-based metrics collection code.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved
3. The final metrics of worker pods should be **pushed to Katib DB directly** in the push mode of metrics collection.

## API

### New Parameter in Python SDK Function `tune`

We decided to add `metrics_collection_mechanism` to `tune` function in Python SDK.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be metrics_collector_config right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot to change it. Thank you!


```Python
def tune(
self,
name: str,
objective: Callable,
parameters: Dict[str, Any],
base_image: str = constants.BASE_IMAGE_TENSORFLOW,
namespace: Optional[str] = None,
env_per_trial: Optional[Union[Dict[str, str], List[Union[client.V1EnvVar, client.V1EnvFromSource]]]] = None,
algorithm_name: str = "random",
algorithm_settings: Union[dict, List[models.V1beta1AlgorithmSetting], None] = None,
objective_metric_name: str = None,
additional_metric_names: List[str] = [],
objective_type: str = "maximize",
objective_goal: float = None,
max_trial_count: int = None,
parallel_trial_count: int = None,
max_failed_trial_count: int = None,
resources_per_trial: Union[dict, client.V1ResourceRequirements, None] = None,
retain_trials: bool = False,
packages_to_install: List[str] = None,
pip_index_url: str = "https://pypi.org/simple",
metrics_collection_mechanism: str = "pull", # The newly added parameter
)
```

## Implementation
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

### Add New Parameter in `tune`

As is mentioned above, we decided to add `metrics_collection_mechanism` to the tune function in Python SDK. Also, we have some changes to be made:

1. Disable injection: set `katib.kubeflow.org/metrics-collector-injection` to `disabled` when the push-based way of metrics collection is adopted so as to disable the injection of the metrics collection sidecar container.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

2. Configure the way of metrics collection: set the configuration `spec.metricsCollectionSpec.collector.kind`(specify the way of metrics collection) to `NoneCollector`.

### Code Injection in Webhook

We decided to implement a code replacing function in Experiment Mutating Webhook. When `spec.metricsCollectionSpec.collector.kind` is set to `NoneCollector`, the code replacing function will recognize the metrics output lines (e.g. print, log.Info, e.t.c.) and replace them with push-based metrics collection code which will be discussed in the next section. It’s a better decision compared with offering users a `katib_client.push`-like interface, for that users can’t use a yaml file to define this operation.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

### Push-based Metrics Collection Code

The push-based metrics collection code is a function making a grpc call to the persistent API to store training metrics. It will be injected to container args in the Experiment Mutating Webhook and then be called inside the Trial Worker Pod to push metrics to Katib DB.
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved

### Collection of Final Metrics

The final metrics of worker pods should be pushed to Katib DB directly in the push mode of metrics collection.

\#WIP
Electronic-Waste marked this conversation as resolved.
Show resolved Hide resolved
Loading