New performance threshold for nightly tests #330

djdameln · 2022-05-23T15:30:28Z

djdameln
May 23, 2022
Maintainer

Currently, the nightly tests use a model- and category-specific performance score threshold, obtained by running a number of repeated training runs, and collecting the lowest observed scores per model and category. This is problematic, because it's always possible to observe an even lower score without anything being wrong with the model.

I propose the following alternative solution:

Run a number of repetitions of each model and category combination, and collect the observed performance score.
Estimate the probability distribution of the observed scores, for each model and category combination (I assume the observed scores are normally distributed).
Compute the lower bound of the 95% (one-tailed) confidence interval, and use this as the failure threshold in the tests.
For any failed test, we now know that there is a 5% chance of false positive. We can use the flaky package to allow multiple attempts.

ashwinvaidya17 · 2022-05-23T15:47:09Z

ashwinvaidya17
May 23, 2022
Maintainer

Another idea (or in addition) can be to run benchmarking script in the nightly and collect the csv file. We can then track these metrics each day to see if there is any constant decrease in performance for a model.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New performance threshold for nightly tests #330

{{title}}

Replies: 1 comment

{{title}}

Select a reply

New performance threshold for nightly tests #330

djdameln May 23, 2022 Maintainer

Replies: 1 comment

ashwinvaidya17 May 23, 2022 Maintainer

djdameln
May 23, 2022
Maintainer

ashwinvaidya17
May 23, 2022
Maintainer