Evaluation

Please note that it is important thay you run the evaluation scripts exactly as mentioned below. In particular, the dstc.py script internally uses a key file that restricts evaluation to the multi-reference subset of the test set, so you might get completely different results if evaluating performance on the full test set (which contains both single- and multi-reference instances).

Note that this setup is borrowed from the DSTC7 Task2 evaluation and some the instructions or script might refer to dates of that campaign, but the scripts within this folder are self-contained and should still work after DSTC7.

Requirements

Works fine for both Python 2.7 and 3.6
Please downloads the following 3rd-party packages and save in a new folder 3rdparty:
- mteval-v14c.pl to compute NIST. You may need to install the following perl modules (e.g. by cpan install): XML:Twig, Sort:Naturally and String:Util.
- meteor-1.5 to compute METEOR. It requires Java.

Create test data:

Please refer to the data extraction page to create the data. To create validation and test data, please run the following command:

make -j4 valid test refs

This will create the multi-reference file, along with followng four files:

Validation data: valid.convos.txt and valid.facts.txt
Test data: test.convos.txt and test.facts.txt

These files are in exactly the same format as train.convos.txt and train.facts.txt already explained here. The only difference is that the response field of test.convos.txt has been replaced with the strings __UNDISCLOSED__.

Notes:

The two validation files are optional and you can skip them if you want (e.g., no need to send us system outputs for them). We provide them so that you can run your own automatic evaluation (BLEU, etc.) by comparing the response field with your own system outputs.
Data creation should take about 1-4 days (depending on your internet connection, etc.). If you run into trouble creating the data, please contact us.

Data statistics

Number of conversational responses:

Validation (valid.convos.txt): 4542 lines
Test (test.convos.txt): 13440 lines

Due to the way the data is created by querying Common Crawl, there may be small differences between your version of the data and our own. To make pairwise comparisons between systems of each pair of participants, we will rely on the largest subset of the test set that is common to both participants. However, if your file test.convos.txt contains less than 13,000 lines, this might be an indication of a problem so please contact us immediately.

Prepare your system output for evaluation:

To create a system output for evaluation, keep the test.convos.txt and relace __UNDISCLOSED__ with your own system output. Call that new file system_output.txt.

Evaluation script:

Note: The script (which is used in the paper) sub-samples a subset of the test data for evaluation.

Steps:

Make sure you 'git pull' the latest changes, including changes in ../data.
cd to ../data and type make. This will create the multi-reference file used by the metrics (../data/test.refs).
Install 3rd party software as instructed above (METEOR and mteval-v14c.pl).
Run the following command, where system_output.txt is the file you want to evaluate: (i.e., created as instructed above).

python dstc.py -c system_output.txt --refs ../data/test.refs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation

Requirements

Create test data:

Data statistics

Prepare your system output for evaluation:

Evaluation script:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation

Requirements

Create test data:

Data statistics

Prepare your system output for evaluation:

Evaluation script: