Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Metrics to base class #21437

Open
damccorm opened this issue Jun 4, 2022 · 19 comments
Open

Add Metrics to base class #21437

damccorm opened this issue Jun 4, 2022 · 19 comments

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022

Add a metric that will track prediction failures.

num_failed_inferences

``

Add metrics that track data on side loaded models

{}num_model_loads{}, {}num_model_versions{}, and curr_model_version

``

curr_model_version is less a metric and more of a static value. Part of this FR will be to figure out how that fits.

Imported from Jira BEAM-14043. Original Jira may contain additional context.
Reported by: Ryan.Thompson.
Subtask of issue #21435

@damccorm
Copy link
Contributor Author

damccorm commented Sep 2, 2022

This probably comes after #22970

@damccorm
Copy link
Contributor Author

Actually, I think there could be value in adding num_failed_inferences before worrying about the side input model loading stuff. That would probably involve putting some error handling around our run_inference call:

result_generator = self._model_handler.run_inference(

and have that update the metrics count. @BjornPrime as you free up would you mind working on this? We probably can't totally close out the issue, but we can make a dent in it.

cc/ @yeandy

@BjornPrime
Copy link
Contributor

.take-issue

@yeandy
Copy link
Contributor

yeandy commented Sep 29, 2022

Several questions around the error handling:

  1. We're working with a batch of data. If inference fails, do we consider that to be 1 num_failed_inferences? Or perhaps it should be renamed to num_failed_batch_inferences? It may depend on framework, but is there a way to figure out which specific data points in a batch fail? In which case we can also track num_failed_inferences.
  2. Do we want to do any retries for the inference call? Maybe we shouldn't, since the default behavior for a model inference failure is to throw an exception. But if we do want retries, how many times? If it succeeds in, say 5 tries, do we swallow the error and ignore increasing the counter? Or do we still increase the counter?
  3. Or should the pipeline always fail?
  4. Should we give users a flag to choose between the behavior?
    @rezarokni Do you have any idea on points 2-4?

@rezarokni
Copy link
Contributor

I think we should borrow from best practice patterns in beam I/O's. And return errors along with the error message in the Prediction object.

Agree a flag would be nice user experience:

Config 1- Fail if any error

If Config 1 ( false )
Config 2- Retry error n times ( maybe n = 3 ?)
Config 3- Return error message along with data point in Prediction object

@damccorm
Copy link
Contributor Author

I think we should borrow from best practice patterns in beam I/O's. And return errors along with the error message in the Prediction object.

+1, I think this is a good experience. It's worth noting that we don't force ModelHandlers to always return a PredictionResult (maybe we should?), but we can at least do this for our handlers. This logic will probably need to live in the modelHandler anyways.

Config 1- Fail if any error

I'm probably -1 on this, I'd prefer to avoid a proliferation of options and to be a little more opinionated here. Failing is generally a bad experience IMO; in streaming pipelines this will potentially cause the pipeline to get stuck, in batch we fail the whole job. This is also a pretty trivial map operation in the next step if that's really what you want to do.

Config 2- Retry error n times ( maybe n = 3 ?)

Retries are more interesting to me, I guess my question there is how often do models actually fail inference in a retryable way? I would anticipate most retries to be non-retryable, and am inclined to not do this flag either.


We're working with a batch of data. If inference fails, do we consider that to be 1 num_failed_inferences? Or perhaps it should be renamed to num_failed_batch_inferences? It may depend on framework, but is there a way to figure out which specific data points in a batch fail? In which case we can also track num_failed_inferences.

IMO handling this at the batch level is fine, anything more complex is probably more trouble than its worth.

@yeandy
Copy link
Contributor

yeandy commented Sep 30, 2022

Retries are more interesting to me, I guess my question there is how often do models actually fail inference in a retryable way? I would anticipate most retries to be non-retryable, and am inclined to not do this flag either.

I'd say most inference failures are due to mismatch in data shapes or data types. Retrying wouldn't help anyway since the user would need to go back and fix their pre-processing logic.

IMO handling this at the batch level is fine, anything more complex is probably more trouble than its worth.

+1

@damccorm
Copy link
Contributor Author

I'd say most inference failures are due to mismatch in data shapes or data types. Retrying wouldn't help anyway since the user would need to go back and fix their pre-processing logic.

Right, that's what I was thinking as well. If we get into remote inference we may want retries for network failures, but at a local level I don't think it makes much sense. ModelHandlers can also make that determination for themselves

@rezarokni
Copy link
Contributor

Right, that's what I was thinking as well. If we get into remote inference we may want retries for network failures, but at a local level I don't think it makes much sense. ModelHandlers can also make that determination for themselves

Yes future remote use cases are why I think retry is useful, as there can be many transient errors, from network to overloaded end points. As we are adding config now, perhaps we can have retry but set the default for 1

@damccorm
Copy link
Contributor Author

Yes future remote use cases are why I think retry is useful, as there can be many transient errors, from network to overloaded end points. As we are adding config now, perhaps we can have retry but set the default for 1

My vote would be to defer it. I don't think it gets us anything right now, and we may decide to just bake in retries on network/server errors no matter what in the remote inference case (that's what I'd vote we do). If we do it now, we're stuck with that option regardless of whether it ends up being useful.

@rezarokni
Copy link
Contributor

SG, lets deffer.

@damccorm damccorm added the ml label Oct 4, 2022
@BjornPrime
Copy link
Contributor

Do we want the _inference_counter to increment even if the inference fails?

@yeandy
Copy link
Contributor

yeandy commented Oct 5, 2022

My take is to not update self._inference_batch_latency_micro_secs, self._inference_counter, self._inference_request_batch_size, self._inference_request_batch_byte_size inside of the update() function since we wouldn't want to skew the statistics based on a failed inferences.

@BjornPrime
Copy link
Contributor

This may be beyond the scope of my assignment, but looking at the issue holistically, I think we're going to want a more generic update_metric() method that can pick out and update particular attributes. If we're going to be adding more attributes to _MetricsCollector, I don't think we'll necessarily want to be running the existing update() every time we change any given metric.

@yeandy
Copy link
Contributor

yeandy commented Oct 6, 2022

Not necessarily out of scope to pose a modification upon current implementation. Can you clarify w/ an example of having a more generic update?

@BjornPrime
Copy link
Contributor

I'm just imagining a situation where we want to update one metric and not all of the others. For example, we could have a method update_metrics(metrics), where "metrics" is a dict with names of metrics as keys and numerical values that are interpreted according to the metric they're associated with. So if I run update_metrics({"_num_failed_inferences": 2}), it would increment _num_failed_inferences twice, but if the key was _inference_batch_latency_micro_secs, it would know to just set that equal to 2. And you could run both together with one call if you include both key-value pairs.
Maybe that's too generic for our purposes. It depends on how likely we are to want to update specific metrics but not others. Valentyn has also pointed out that none of this is directly exposed so if we need to change it later, we shouldn't end up breaking anybody else.
I think we do need to consider how a failure affects the other metrics though. For example, the inference counter basically just adds the number of items in the batch after the batch finishes, but if the batch is interrupted by failures, that number won't be accurate will it? Or maybe I'm misunderstanding what's going on under the hood of run_inference.

@yeandy
Copy link
Contributor

yeandy commented Oct 7, 2022

I like this proposal, as it makes it more flexible.

it would increment _num_failed_inferences twice

To be clear, there are two types: Metrics.counter() and Metrics.distribution, where we "increment" the self._inference_counter and "update" the distributions of self._inference_batch_latency_micro_secs, self._inference_request_batch_size, self._inference_request_batch_byte_size.

So we should also have something like increment_metrics() for _inference_counter and _failed_inference_counter (for _num_failed_inferences).

I think we do need to consider how a failure affects the other metrics though. For example, the inference counter basically just adds the number of items in the batch after the batch finishes, but if the batch is interrupted by failures, that number won't be accurate will it? Or maybe I'm misunderstanding what's going on under the hood of run_inference.

That's the right interpretation. See my comment #21437 (comment). I think the only thing we would want to update is _num_failed_inferences with the number of elements of that batch that did fail.

@BjornPrime
Copy link
Contributor

So, just to make sure I'm understanding, one failed inference doesn't interrupt the rest of the batch from completing?

@yeandy
Copy link
Contributor

yeandy commented Oct 7, 2022

To my knowledge, particularly for scikit-learn and pytorch, if the numpy array or torch tensor that we pass to the model contains even one invalid/malformed instance, then predicting for the entire batch will fail. But we may need to do some testing, and more digging around to confirm, especially for TensorFlow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants