Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Proposal to mxnet.metric #18046

Closed
acphile opened this issue Apr 14, 2020 · 4 comments
Closed

Proposal to mxnet.metric #18046

acphile opened this issue Apr 14, 2020 · 4 comments

Comments

@acphile
Copy link
Contributor

acphile commented Apr 14, 2020

Motivation

mxnet.metric provides different methods for users to judge the performance of models. But currently there are some shortcomings which need to be improved in mxnet.metric. We propose to refactor the metrics interface to fix all issues and place the new interface under mx.gluon.metrics.

In general, we want to make the following improvements:

  1. Moving the API to the gluon namespace
  2. Make the API more user-friendly and pythonic
  3. Structure the API to make hybridization of the complete training loop more easily feasible in the future.

1. Inconsistency in computational granularity of metrics

Currently there are two computational granularities in mxnet.metric:

  1. “macro” level: calculate average performance per batch , like implementation in MAE
  2. “micro” level: calculate average performance per sample, like implementation in Accuracy, CrossEntropy

Generally, “micro” level is more useful because usually we focus on average performance of data samples in the test set rather than that of testing batches. So here we need to make arrangements between these metrics.

2. For future hybridization of the complete training loop

Currently metrics in mxnet.metric receives “list of NDArray” and calculate results by numpy. In fact, many metrics’ computation could be implemented in nn.HybridBlock. Using HybridBlock.hybridize(), the computation could be done in the backend, which could be faster. By refactoring the mxnet.metric, we could one day compile the model with the metric like Tensorflow and do the complete training loop including evaluation fully in the backend. Thus our new API design takes into account the hybridization use-case, so that hybridizing the complete training loop will be easily possible once the backend support is there.

3. lacking some useful metrics

Although many metrics are already included, some still need to be implemented.

Apart from the metrics already provided in mxnet.metric: http://mxnet.incubator.apache.org/api/python/docs/api/metric/index.html?highlight=metric#module-mxnet.metric , we plan to add the following metrics:

  1. F-beta score: (1+beta^2)precisionrecall/(beta^2*precision+recall)
  2. binary accuracy with threshold: using a confidence threshold to judge whether the example is positive or negative
  3. MeanCosineSimilarity: return the average cosin similarity between predictions and ground truth
  4. MeanPairwiseDistance: return the average pairwise distance between predictions and ground truth

4. Fixing issues in the existing metrics

Some special cases and input shapes need to be examined and fixed.
About EvalMetric (base class in metrics.py)

  1. distinction between local and global:
    a. Currently for metrics in metric.py, when update() is called, both local accumulator and global accumulator are updated with the same value.
    b. Global accumulator may be useful when there are different parts during evaluation (for example, joint training on different datasets). You may want to get evaluation result of one part and call “reset_local()” to continue the evaluation for next part. In the end, you can call “get_global()” to obtain the overall evaluation performance.
    c. You may also define the way to update local and global results in your own metric(EvalMetric)
  2. parameter “output_names” “label_names” and method “update_dict”
    a. Seemingly I only find “update_dict” in “https://github.com/apache/incubator-mxnet/blob/48e9e2c6a1544843ba860124f4eaa8e7bac6100b/python/mxnet/module/executor_group.py”, where I think using “update” is also reasonable.
    b. I don’t know where the corresponding parameter "output_names","label_names" could be used, since there are not corresponding examples.
  3. get_name_value()
    a. return metric’s name and metric’s evalutaion value pairs.
    b. It is helpful when using CompositeEvalMetric

Here are the detailed changes to be made:

  1. improve Class MAE (and MSE, RMSE)
    a. including parameter “average”, default average=“macro”
    i. “macro” represents average per batch
    ii. “micro” represents average per example
    b. including micro level calculation:
  2. improve Class _BinaryClassification
    a. support the situation len(pred.shape)==1
    i. for binary classification, we only need to output a confidence score of being positive, like: pred=[0.1,0.3,0.7] or like pred=[[0.1],[0.3],[0.7]]
    b. including parameter “threshold”, default: threshold=0.5
    i. sometimes we may need to define a threshold that when confidence(positive) > threshold, we classify it as positive, otherwise negative
    c. including parameter “beta” default: beta=1
    i. updating “fscore” calculation with F-beta= (1+beta^2)precisionrecall/(beta^2*precision+recall), which is more general
    d. including method binary_accuracy:
    i. calculation: (true_positives+true_negatives)/total_examples
  3. improve Class TopKAccuracy
    a. Line 578-579: self.global_sum_metric should be accumulated
  4. add Class MeanCosineSimilarity(axis=-1, eps=1e-12)
  5. add Class MeanPairwiseDistance(p=2)

Comparisons with other framework

Compared with Pytorch Ignite

Reference: https://pytorch.org/ignite/metrics.html
Base class for metrics is implemented independently. Metrics in ignite.metrics use .attach() method to use the output of the engine’s process_function. It is done by letting the engine to add_event_handler.
Metric arithmetics are supported, which is like mxnet.metrics.CustomMetric
Some metrics currently are not included in ours:

  1. ConfusionMatrix
  2. DiceCoefficient()
  3. IoU()
  4. mIoU()
  5. MeanPairwiseDistance

Compared with Tensorflow Keras

Reference: https://tensorflow.google.cn/api_docs/python/tf/keras/metrics?hl=en
Base class for metrics inherits from tf.keras.engine.base_layer.Layer, which is also the class from which all layers inherit. Metric functions in tf.keras.metrics could be supplied in the metrics parameter when a model is compiled.
Generally, metric functions in tf.keras.metrics have an input sample_weight defining contributing weights when updating the states.
tf.keras.metrics use Accuracyand SparseCategoricalAccuracyto denote the situation that y_pred is predicted label and the situation that y_pred is probability distribution, which I think may be to avoid internal shape checking. Currently we could combine them in one metric.
Some metrics currently are not included in ours:

  1. AUC
  2. BinaryAccuracy
  3. Hinge related, like SquaredHinge Hinge CategoricalHinge
  4. CosineSimilarity
  5. KLDivergence
  6. LogCoshError :logcosh = log((exp(x) + exp(-x))/2), where x is the error (y_pred - y_true)
  7. MeanIoU
  8. Poisson
  9. SensitivityAtSpecificity
@sxjscience
Copy link
Member

I think we can also borrow ideas from the design in AllenNLP: https://github.com/allenai/allennlp/tree/master/allennlp/training/metrics

@acphile acphile mentioned this issue Apr 16, 2020
7 tasks
@sxjscience
Copy link
Member

Also, I suggest to remove the option of macro averaging. I don't think the current implementation is correct. In scikit-learn, there is no macro option for MAE (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), MSE (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error). And for F1 score, the macro option is used for multi-label/multi-class prediction. See also: #9586 (comment)

@acphile
Copy link
Contributor Author

acphile commented Apr 27, 2020

Here are the updated changes to be made:

1. improve Class MAE, MSE, RMSE

a. UPD: remove “macro” supports which represents average per batch
b. Rewrite RMSE to inherit from MSE

2. improve Class _BinaryClassification

a. UPD: including parameter “class_type” in [‘binary’, ‘multiclass’, ‘multilabel’]
b. support the situation len(pred.shape)==1 for class_type='binary'
     i. for binary classification, we only need to output a confidence score of being positive, like: pred=[0.1,0.3,0.7] or like pred=[[0.1],[0.3],[0.7]]
c. including parameter “threshold”, default: threshold=0.5
     i. sometimes we may need to define a threshold that when confidence(positive) > threshold, we classify it as positive, otherwise negative
     ii. used when class_type in [‘binary’, ‘multilabel’]
d. including parameter “beta” default: beta=1
     i. updating “fscore” calculation with F-beta= (1+beta^2)*precision*recall/(beta^2*precision+recall), which is more general
e. UPD: add cases for multillabel/multiclass
     i. including paramater ‘class_type’ in [‘binary’, ‘multilabel’, ‘multiclass’]
     ii. For ‘multilabel’, pred should be (N, ..., C) and label should be (N, ..., C)
     iii. For ‘multiclass’, pred should be (N, ..., C) and label should be (N, ...)
f. UPD: replace global_fscore with micro_fscore

3. add Class BinaryAccuracy(threshold=0.5)

4. add Class MeanCosineSimilarity(axis=-1, eps=1e-12)

5. add Class MeanPairwiseDistance(p=2)

6. improve Class F1:

a. F1(class_type="binary", threshold=0.5, average="micro")
b. average in [“binary”, “micro”, “macro”]:
     i. "macro": Calculate metrics for each label and return unweighted mean of f1.
     ii. "micro": Calculate metrics globally by counting the total TP, FN and FP.
     iii. None: Return f1 scores for each class (numpy.ndarray) .

7. add Class Fbeta(class_type="binary", beta=1, threshold=0.5, average="micro")

8. UPD: using mxnet.numpy instead of numpy

@leezu
Copy link
Contributor

leezu commented May 27, 2020

Closed by #18083

@leezu leezu closed this as completed May 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants