-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Proposal to mxnet.metric #18046
Comments
I think we can also borrow ideas from the design in AllenNLP: https://github.com/allenai/allennlp/tree/master/allennlp/training/metrics |
Also, I suggest to remove the option of |
Here are the updated changes to be made: 1. improve Class MAE, MSE, RMSEa. UPD: remove “macro” supports which represents average per batch 2. improve Class _BinaryClassificationa. UPD: including parameter “class_type” in [‘binary’, ‘multiclass’, ‘multilabel’] 3. add Class BinaryAccuracy(threshold=0.5)4. add Class MeanCosineSimilarity(axis=-1, eps=1e-12)5. add Class MeanPairwiseDistance(p=2)6. improve Class F1:a. F1(class_type="binary", threshold=0.5, average="micro") 7. add Class Fbeta(class_type="binary", beta=1, threshold=0.5, average="micro")8. UPD: using mxnet.numpy instead of numpy |
Closed by #18083 |
Motivation
mxnet.metric provides different methods for users to judge the performance of models. But currently there are some shortcomings which need to be improved in mxnet.metric. We propose to refactor the metrics interface to fix all issues and place the new interface under mx.gluon.metrics.
In general, we want to make the following improvements:
1. Inconsistency in computational granularity of metrics
Currently there are two computational granularities in mxnet.metric:
Generally, “micro” level is more useful because usually we focus on average performance of data samples in the test set rather than that of testing batches. So here we need to make arrangements between these metrics.
2. For future hybridization of the complete training loop
Currently metrics in mxnet.metric receives “list of NDArray” and calculate results by numpy. In fact, many metrics’ computation could be implemented in nn.HybridBlock. Using HybridBlock.hybridize(), the computation could be done in the backend, which could be faster. By refactoring the mxnet.metric, we could one day compile the model with the metric like Tensorflow and do the complete training loop including evaluation fully in the backend. Thus our new API design takes into account the hybridization use-case, so that hybridizing the complete training loop will be easily possible once the backend support is there.
3. lacking some useful metrics
Although many metrics are already included, some still need to be implemented.
Apart from the metrics already provided in mxnet.metric: http://mxnet.incubator.apache.org/api/python/docs/api/metric/index.html?highlight=metric#module-mxnet.metric , we plan to add the following metrics:
4. Fixing issues in the existing metrics
Some special cases and input shapes need to be examined and fixed.
About EvalMetric (base class in metrics.py)
a. Currently for metrics in metric.py, when update() is called, both local accumulator and global accumulator are updated with the same value.
b. Global accumulator may be useful when there are different parts during evaluation (for example, joint training on different datasets). You may want to get evaluation result of one part and call “reset_local()” to continue the evaluation for next part. In the end, you can call “get_global()” to obtain the overall evaluation performance.
c. You may also define the way to update local and global results in your own metric(EvalMetric)
a. Seemingly I only find “update_dict” in “https://github.com/apache/incubator-mxnet/blob/48e9e2c6a1544843ba860124f4eaa8e7bac6100b/python/mxnet/module/executor_group.py”, where I think using “update” is also reasonable.
b. I don’t know where the corresponding parameter "output_names","label_names" could be used, since there are not corresponding examples.
a. return metric’s name and metric’s evalutaion value pairs.
b. It is helpful when using CompositeEvalMetric
Here are the detailed changes to be made:
a. including parameter “average”, default average=“macro”
i. “macro” represents average per batch
ii. “micro” represents average per example
b. including micro level calculation:
a. support the situation len(pred.shape)==1
i. for binary classification, we only need to output a confidence score of being positive, like: pred=[0.1,0.3,0.7] or like pred=[[0.1],[0.3],[0.7]]
b. including parameter “threshold”, default: threshold=0.5
i. sometimes we may need to define a threshold that when confidence(positive) > threshold, we classify it as positive, otherwise negative
c. including parameter “beta” default: beta=1
i. updating “fscore” calculation with F-beta= (1+beta^2)precisionrecall/(beta^2*precision+recall), which is more general
d. including method binary_accuracy:
i. calculation: (true_positives+true_negatives)/total_examples
a. Line 578-579: self.global_sum_metric should be accumulated
Comparisons with other framework
Compared with Pytorch Ignite
Reference: https://pytorch.org/ignite/metrics.html
Base class for metrics is implemented independently. Metrics in ignite.metrics use .attach() method to use the output of the engine’s process_function. It is done by letting the engine to add_event_handler.
Metric arithmetics are supported, which is like mxnet.metrics.CustomMetric
Some metrics currently are not included in ours:
Compared with Tensorflow Keras
Reference: https://tensorflow.google.cn/api_docs/python/tf/keras/metrics?hl=en
Base class for metrics inherits from tf.keras.engine.base_layer.Layer, which is also the class from which all layers inherit. Metric functions in tf.keras.metrics could be supplied in the metrics parameter when a model is compiled.
Generally, metric functions in tf.keras.metrics have an input sample_weight defining contributing weights when updating the states.
tf.keras.metrics use Accuracyand SparseCategoricalAccuracyto denote the situation that y_pred is predicted label and the situation that y_pred is probability distribution, which I think may be to avoid internal shape checking. Currently we could combine them in one metric.
Some metrics currently are not included in ours:
The text was updated successfully, but these errors were encountered: