Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tests expect only 'mqm.merged', but the actual data contains individual rater scores. #19

Open
Smu-Tan opened this issue Oct 22, 2024 · 0 comments

Comments

@Smu-Tan
Copy link

Smu-Tan commented Oct 22, 2024

Hi,

When running python3 -m unittest discover mt_metrics_eval "*_test.py" # Takes ~70 seconds., i got the following errors.

for testWMT23EnDeRatings and testWMT23ZhEnRatings: These tests are failing because there's a mismatch in the expected human rating names. The tests expect only 'mqm.merged', but the actual data contains individual rater scores ('mqm.rater1' through 'mqm.rater10') and additional merged scores ('round2.mqm.merged', 'round3.mqm.merged').

(mtme) bash-4.4$ python3 -m unittest discover mt_metrics_eval "*_test.py"  # Takes ~70 seconds.
............F.F............................................./ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/stats.py:923: RuntimeWarning: invalid value encountered in sqrt
  tden = np.sqrt(2 * (n - 1) / (n - 3) * k + rbar**2 * (1 - r23)**3)
F.......................
======================================================================
FAIL: testWMT23EnDeRatings (data_test.EvalSetTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/data_test.py", line 283, in testWMT23EnDeRatings
    self.assertEqual(evs.human_rating_names, {'mqm.merged'})
AssertionError: Items in the first set but not the second:
'mqm.rater8'
'mqm.rater10'
'mqm.rater9'
'mqm.rater5'
'mqm.rater6'
'mqm.rater3'
'mqm.rater2'
'mqm.rater4'
'round2.mqm.merged'
'mqm.rater1'
'mqm.rater7'
'round3.mqm.merged'

======================================================================
FAIL: testWMT23ZhEnRatings (data_test.EvalSetTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/data_test.py", line 319, in testWMT23ZhEnRatings
    self.assertEqual(
AssertionError: Items in the first set but not the second:
'mqm.merged'
'round3.mqm.merged'
'round2.mqm.merged'

======================================================================
FAIL: testSigDiffWithAvgAndNones (stats_test.WilliamsSigDiffTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/ivi/ilps/personal/stan1/reward/mt-metrics-eval/mt_metrics_eval/stats_test.py", line 527, in testSigDiffWithAvgAndNones
    self.assertAlmostEqual(p, 0.121, places=3)
AssertionError: nan != 0.121 within 3 places (nan difference)

----------------------------------------------------------------------
Ran 84 tests in 108.139s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant