The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

Smu-Tan · 2024-10-22T11:33:16Z

Hi,

When running python3 -m unittest discover mt_metrics_eval "*_test.py" # Takes ~70 seconds., i got the following errors.

for testWMT23EnDeRatings and testWMT23ZhEnRatings: These tests are failing because there's a mismatch in the expected human rating names. The tests expect only 'mqm.merged', but the actual data contains individual rater scores ('mqm.rater1' through 'mqm.rater10') and additional merged scores ('round2.mqm.merged', 'round3.mqm.merged').

(mtme) bash-4.4$ python3 -m unittest discover mt_metrics_eval "*_test.py"  # Takes ~70 seconds.
............F.F............................................./ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/stats.py:923: RuntimeWarning: invalid value encountered in sqrt
  tden = np.sqrt(2 * (n - 1) / (n - 3) * k + rbar**2 * (1 - r23)**3)
F.......................
======================================================================
FAIL: testWMT23EnDeRatings (data_test.EvalSetTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/data_test.py", line 283, in testWMT23EnDeRatings
    self.assertEqual(evs.human_rating_names, {'mqm.merged'})
AssertionError: Items in the first set but not the second:
'mqm.rater8'
'mqm.rater10'
'mqm.rater9'
'mqm.rater5'
'mqm.rater6'
'mqm.rater3'
'mqm.rater2'
'mqm.rater4'
'round2.mqm.merged'
'mqm.rater1'
'mqm.rater7'
'round3.mqm.merged'

======================================================================
FAIL: testWMT23ZhEnRatings (data_test.EvalSetTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/data_test.py", line 319, in testWMT23ZhEnRatings
    self.assertEqual(
AssertionError: Items in the first set but not the second:
'mqm.merged'
'round3.mqm.merged'
'round2.mqm.merged'

======================================================================
FAIL: testSigDiffWithAvgAndNones (stats_test.WilliamsSigDiffTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/stats_test.py", line 527, in testSigDiffWithAvgAndNones
    self.assertAlmostEqual(p, 0.121, places=3)
AssertionError: nan != 0.121 within 3 places (nan difference)

----------------------------------------------------------------------
Ran 84 tests in 108.139s

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

Smu-Tan commented Oct 22, 2024

The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

Comments

Smu-Tan commented Oct 22, 2024