Metric functions to be hooked and executed at the scoring stage
Metrics are applied at the end of the pipeline to determine the best performing model for the input data.
In our case each metric is a function of the full real (input) data and the synthetic (output) data. There are a few methods that should be applied to assess
- statistical similarity,
- ML performance, and
- anonymity
Since this list can grow, and it migh…
Metrics are applied at the end of the pipeline to determine the best performing model for the input data.
In our case each metric is a function of the full real (input) data and the synthetic (output) data. There are a few methods that should be applied to assess
- statistical similarity,
- ML performance, and
- anonymity
Since this list can grow, and it might not be adequate or desirable (run-time constraints, etc.) to apply all defined metrics, we define a hook variable to hold all metrics to be applied by the constructed pipeline. A hook is a variable holding a list of functions, combined with a function to execute all functions in that list at a well-specified point in the course of the program. The user of the program has to specify the list of metrics functions prior to running.
The first list of metrics to be implemented is
Chi-squared test and Kolmogorov-Smirnov test (univariate similarity),
PCA and Spearman-corellation comparisons (correlations)
TRTR/TSTR using logistic regression (ML-performance)
mean pairwise distance and maximum cosine similarity
All hooked metrics are to be applied on each of the model outputs. As mentioned above, metric functions are functions of input and output data of the models, they return a single floating point value, for which the lower values denote better results.
In the end the metric values will be summed up for each synthesis model and compared.
In a next step we have to implement a normalization process to be able to compare overall performance of the best model.