Skip to content

Latest commit

 

History

History
executable file
·
54 lines (35 loc) · 3.88 KB

README.md

File metadata and controls

executable file
·
54 lines (35 loc) · 3.88 KB

Evaluation

Please note that it is important thay you run the evaluation scripts exactly as mentioned below. In particular, the dstc.py script internally uses a key file that restricts evaluation to the multi-reference subset of the test set, so you might get completely different results if evaluating performance on the full test set (which contains both single- and multi-reference instances).

Note that this setup is borrowed from the DSTC7 Task2 evaluation and some the instructions or script might refer to dates of that campaign, but the scripts within this folder are self-contained and should still work after DSTC7.

Requirements

  • Works fine for both Python 2.7 and 3.6
  • Please downloads the following 3rd-party packages and save in a new folder 3rdparty:

Create test data:

Please refer to the data extraction page to create the data. To create validation and test data, please run the following command:

make -j4 valid test refs

This will create the multi-reference file, along with followng four files:

  • Validation data: valid.convos.txt and valid.facts.txt
  • Test data: test.convos.txt and test.facts.txt

These files are in exactly the same format as train.convos.txt and train.facts.txt already explained here. The only difference is that the response field of test.convos.txt has been replaced with the strings __UNDISCLOSED__.

Notes:

  • The two validation files are optional and you can skip them if you want (e.g., no need to send us system outputs for them). We provide them so that you can run your own automatic evaluation (BLEU, etc.) by comparing the response field with your own system outputs.
  • Data creation should take about 1-4 days (depending on your internet connection, etc.). If you run into trouble creating the data, please contact us.

Data statistics

Number of conversational responses:

  • Validation (valid.convos.txt): 4542 lines
  • Test (test.convos.txt): 13440 lines

Due to the way the data is created by querying Common Crawl, there may be small differences between your version of the data and our own. To make pairwise comparisons between systems of each pair of participants, we will rely on the largest subset of the test set that is common to both participants. However, if your file test.convos.txt contains less than 13,000 lines, this might be an indication of a problem so please contact us immediately.

Prepare your system output for evaluation:

To create a system output for evaluation, keep the test.convos.txt and relace __UNDISCLOSED__ with your own system output. Call that new file system_output.txt.

Evaluation script:

Note: The script (which is used in the paper) sub-samples a subset of the test data for evaluation.

Steps:

  1. Make sure you 'git pull' the latest changes, including changes in ../data.
  2. cd to ../data and type make. This will create the multi-reference file used by the metrics (../data/test.refs).
  3. Install 3rd party software as instructed above (METEOR and mteval-v14c.pl).
  4. Run the following command, where system_output.txt is the file you want to evaluate: (i.e., created as instructed above).
python dstc.py -c system_output.txt --refs ../data/test.refs