Please note that it is important thay you run the evaluation scripts exactly as mentioned below. In particular, the dstc.py script internally uses a key file that restricts evaluation to the multi-reference subset of the test set, so you might get completely different results if evaluating performance on the full test set (which contains both single- and multi-reference instances).
Note that this setup is borrowed from the DSTC7 Task2 evaluation and some the instructions or script might refer to dates of that campaign, but the scripts within this folder are self-contained and should still work after DSTC7.
- Works fine for both Python 2.7 and 3.6
- Please downloads the following 3rd-party packages and save in a new folder
3rdparty
:- mteval-v14c.pl to compute NIST. You may need to install the following perl modules (e.g. by
cpan install
): XML:Twig, Sort:Naturally and String:Util. - meteor-1.5 to compute METEOR. It requires Java.
- mteval-v14c.pl to compute NIST. You may need to install the following perl modules (e.g. by
Please refer to the data extraction page to create the data. To create validation and test data, please run the following command:
make -j4 valid test refs
This will create the multi-reference file, along with followng four files:
- Validation data:
valid.convos.txt
andvalid.facts.txt
- Test data:
test.convos.txt
andtest.facts.txt
These files are in exactly the same format as train.convos.txt
and train.facts.txt
already explained here. The only difference is that the response
field of test.convos.txt has been replaced with the strings __UNDISCLOSED__
.
Notes:
- The two validation files are optional and you can skip them if you want (e.g., no need to send us system outputs for them). We provide them so that you can run your own automatic evaluation (BLEU, etc.) by comparing the
response
field with your own system outputs. - Data creation should take about 1-4 days (depending on your internet connection, etc.). If you run into trouble creating the data, please contact us.
Number of conversational responses:
- Validation (valid.convos.txt): 4542 lines
- Test (test.convos.txt): 13440 lines
Due to the way the data is created by querying Common Crawl, there may be small differences between your version of the data and our own. To make pairwise comparisons between systems of each pair of participants, we will rely on the largest subset of the test set that is common to both participants. However, if your file test.convos.txt contains less than 13,000 lines, this might be an indication of a problem so please contact us immediately.
To create a system output for evaluation, keep the test.convos.txt
and relace __UNDISCLOSED__
with your own system output. Call that new file system_output.txt
.
Note: The script (which is used in the paper) sub-samples a subset of the test data for evaluation.
Steps:
- Make sure you 'git pull' the latest changes, including changes in ../data.
- cd to
../data
and type make. This will create the multi-reference file used by the metrics (../data/test.refs
). - Install 3rd party software as instructed above (METEOR and mteval-v14c.pl).
- Run the following command, where
system_output.txt
is the file you want to evaluate: (i.e., created as instructed above).
python dstc.py -c system_output.txt --refs ../data/test.refs