Add Metrics to base class #21437

damccorm · 2022-06-04T23:31:28Z

Add a metric that will track prediction failures.

num_failed_inferences

``

Add metrics that track data on side loaded models

{}num_model_loads{}, {}num_model_versions{}, and curr_model_version

``

curr_model_version is less a metric and more of a static value. Part of this FR will be to figure out how that fits.

Imported from Jira BEAM-14043. Original Jira may contain additional context.
Reported by: Ryan.Thompson.
Subtask of issue #21435

The text was updated successfully, but these errors were encountered:

damccorm · 2022-09-02T14:01:51Z

This probably comes after #22970

damccorm · 2022-09-14T17:29:30Z

Actually, I think there could be value in adding num_failed_inferences before worrying about the side input model loading stuff. That would probably involve putting some error handling around our run_inference call:

beam/sdks/python/apache_beam/ml/inference/base.py

Line 418 in 2d4f61c

result_generator = self._model_handler.run_inference(

and have that update the metrics count. @BjornPrime as you free up would you mind working on this? We probably can't totally close out the issue, but we can make a dent in it.

cc/ @yeandy

BjornPrime · 2022-09-14T18:10:27Z

.take-issue

yeandy · 2022-09-29T21:36:22Z

Several questions around the error handling:

We're working with a batch of data. If inference fails, do we consider that to be 1 num_failed_inferences? Or perhaps it should be renamed to num_failed_batch_inferences? It may depend on framework, but is there a way to figure out which specific data points in a batch fail? In which case we can also track num_failed_inferences.
Do we want to do any retries for the inference call? Maybe we shouldn't, since the default behavior for a model inference failure is to throw an exception. But if we do want retries, how many times? If it succeeds in, say 5 tries, do we swallow the error and ignore increasing the counter? Or do we still increase the counter?
Or should the pipeline always fail?
Should we give users a flag to choose between the behavior?
@rezarokni Do you have any idea on points 2-4?

rezarokni · 2022-09-29T22:10:44Z

I think we should borrow from best practice patterns in beam I/O's. And return errors along with the error message in the Prediction object.

Agree a flag would be nice user experience:

Config 1- Fail if any error

If Config 1 ( false )
Config 2- Retry error n times ( maybe n = 3 ?)
Config 3- Return error message along with data point in Prediction object

damccorm · 2022-09-30T14:04:42Z

I think we should borrow from best practice patterns in beam I/O's. And return errors along with the error message in the Prediction object.

+1, I think this is a good experience. It's worth noting that we don't force ModelHandlers to always return a PredictionResult (maybe we should?), but we can at least do this for our handlers. This logic will probably need to live in the modelHandler anyways.

Config 1- Fail if any error

I'm probably -1 on this, I'd prefer to avoid a proliferation of options and to be a little more opinionated here. Failing is generally a bad experience IMO; in streaming pipelines this will potentially cause the pipeline to get stuck, in batch we fail the whole job. This is also a pretty trivial map operation in the next step if that's really what you want to do.

Config 2- Retry error n times ( maybe n = 3 ?)

Retries are more interesting to me, I guess my question there is how often do models actually fail inference in a retryable way? I would anticipate most retries to be non-retryable, and am inclined to not do this flag either.

We're working with a batch of data. If inference fails, do we consider that to be 1 num_failed_inferences? Or perhaps it should be renamed to num_failed_batch_inferences? It may depend on framework, but is there a way to figure out which specific data points in a batch fail? In which case we can also track num_failed_inferences.

IMO handling this at the batch level is fine, anything more complex is probably more trouble than its worth.

yeandy · 2022-09-30T14:26:00Z

Retries are more interesting to me, I guess my question there is how often do models actually fail inference in a retryable way? I would anticipate most retries to be non-retryable, and am inclined to not do this flag either.

I'd say most inference failures are due to mismatch in data shapes or data types. Retrying wouldn't help anyway since the user would need to go back and fix their pre-processing logic.

IMO handling this at the batch level is fine, anything more complex is probably more trouble than its worth.

+1

damccorm · 2022-09-30T14:30:05Z

I'd say most inference failures are due to mismatch in data shapes or data types. Retrying wouldn't help anyway since the user would need to go back and fix their pre-processing logic.

Right, that's what I was thinking as well. If we get into remote inference we may want retries for network failures, but at a local level I don't think it makes much sense. ModelHandlers can also make that determination for themselves

rezarokni · 2022-09-30T17:13:08Z

Right, that's what I was thinking as well. If we get into remote inference we may want retries for network failures, but at a local level I don't think it makes much sense. ModelHandlers can also make that determination for themselves

Yes future remote use cases are why I think retry is useful, as there can be many transient errors, from network to overloaded end points. As we are adding config now, perhaps we can have retry but set the default for 1

damccorm · 2022-09-30T17:18:02Z

Yes future remote use cases are why I think retry is useful, as there can be many transient errors, from network to overloaded end points. As we are adding config now, perhaps we can have retry but set the default for 1

My vote would be to defer it. I don't think it gets us anything right now, and we may decide to just bake in retries on network/server errors no matter what in the remote inference case (that's what I'd vote we do). If we do it now, we're stuck with that option regardless of whether it ends up being useful.

rezarokni · 2022-09-30T17:34:31Z

SG, lets deffer.

BjornPrime · 2022-10-05T17:08:33Z

Do we want the _inference_counter to increment even if the inference fails?

yeandy · 2022-10-05T17:12:47Z

My take is to not update self._inference_batch_latency_micro_secs, self._inference_counter, self._inference_request_batch_size, self._inference_request_batch_byte_size inside of the update() function since we wouldn't want to skew the statistics based on a failed inferences.

BjornPrime · 2022-10-06T14:53:29Z

This may be beyond the scope of my assignment, but looking at the issue holistically, I think we're going to want a more generic update_metric() method that can pick out and update particular attributes. If we're going to be adding more attributes to _MetricsCollector, I don't think we'll necessarily want to be running the existing update() every time we change any given metric.

yeandy · 2022-10-06T17:48:16Z

Not necessarily out of scope to pose a modification upon current implementation. Can you clarify w/ an example of having a more generic update?

BjornPrime · 2022-10-06T19:01:21Z

I'm just imagining a situation where we want to update one metric and not all of the others. For example, we could have a method update_metrics(metrics), where "metrics" is a dict with names of metrics as keys and numerical values that are interpreted according to the metric they're associated with. So if I run update_metrics({"_num_failed_inferences": 2}), it would increment _num_failed_inferences twice, but if the key was _inference_batch_latency_micro_secs, it would know to just set that equal to 2. And you could run both together with one call if you include both key-value pairs.
Maybe that's too generic for our purposes. It depends on how likely we are to want to update specific metrics but not others. Valentyn has also pointed out that none of this is directly exposed so if we need to change it later, we shouldn't end up breaking anybody else.
I think we do need to consider how a failure affects the other metrics though. For example, the inference counter basically just adds the number of items in the batch after the batch finishes, but if the batch is interrupted by failures, that number won't be accurate will it? Or maybe I'm misunderstanding what's going on under the hood of run_inference.

yeandy · 2022-10-07T12:23:07Z

I like this proposal, as it makes it more flexible.

it would increment _num_failed_inferences twice

To be clear, there are two types: Metrics.counter() and Metrics.distribution, where we "increment" the self._inference_counter and "update" the distributions of self._inference_batch_latency_micro_secs, self._inference_request_batch_size, self._inference_request_batch_byte_size.

So we should also have something like increment_metrics() for _inference_counter and _failed_inference_counter (for _num_failed_inferences).

I think we do need to consider how a failure affects the other metrics though. For example, the inference counter basically just adds the number of items in the batch after the batch finishes, but if the batch is interrupted by failures, that number won't be accurate will it? Or maybe I'm misunderstanding what's going on under the hood of run_inference.

That's the right interpretation. See my comment #21437 (comment). I think the only thing we would want to update is _num_failed_inferences with the number of elements of that batch that did fail.

BjornPrime · 2022-10-07T14:50:58Z

So, just to make sure I'm understanding, one failed inference doesn't interrupt the rest of the batch from completing?

yeandy · 2022-10-07T20:47:00Z

To my knowledge, particularly for scikit-learn and pytorch, if the numpy array or torch tensor that we pass to the model contains even one invalid/malformed instance, then predicting for the entire batch will fail. But we may need to do some testing, and more digging around to confirm, especially for TensorFlow.

damccorm added P2 sdk-py-core sub-task labels Jun 4, 2022

damccorm mentioned this issue Jun 4, 2022

RunInference V1 #21435

Open

damccorm added core py python and removed sdk-py-core labels Jun 16, 2022

damccorm added the run-inference label Aug 31, 2022

github-actions bot assigned BjornPrime Sep 14, 2022

damccorm added the ml label Oct 4, 2022

BjornPrime mentioned this issue Oct 25, 2022

Num failed inferences #23830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Metrics to base class #21437

Add Metrics to base class #21437

damccorm commented Jun 4, 2022

damccorm commented Sep 2, 2022

damccorm commented Sep 14, 2022

BjornPrime commented Sep 14, 2022

yeandy commented Sep 29, 2022

rezarokni commented Sep 29, 2022

damccorm commented Sep 30, 2022

yeandy commented Sep 30, 2022

damccorm commented Sep 30, 2022

rezarokni commented Sep 30, 2022

damccorm commented Sep 30, 2022

rezarokni commented Sep 30, 2022

BjornPrime commented Oct 5, 2022

yeandy commented Oct 5, 2022

BjornPrime commented Oct 6, 2022

yeandy commented Oct 6, 2022

BjornPrime commented Oct 6, 2022

yeandy commented Oct 7, 2022

BjornPrime commented Oct 7, 2022

yeandy commented Oct 7, 2022

Add Metrics to base class #21437

Add Metrics to base class #21437

Comments

damccorm commented Jun 4, 2022

damccorm commented Sep 2, 2022

damccorm commented Sep 14, 2022

BjornPrime commented Sep 14, 2022

yeandy commented Sep 29, 2022

rezarokni commented Sep 29, 2022

damccorm commented Sep 30, 2022

yeandy commented Sep 30, 2022

damccorm commented Sep 30, 2022

rezarokni commented Sep 30, 2022

damccorm commented Sep 30, 2022

rezarokni commented Sep 30, 2022

BjornPrime commented Oct 5, 2022

yeandy commented Oct 5, 2022

BjornPrime commented Oct 6, 2022

yeandy commented Oct 6, 2022

BjornPrime commented Oct 6, 2022

yeandy commented Oct 7, 2022

BjornPrime commented Oct 7, 2022

yeandy commented Oct 7, 2022