WMT14 Notes

Using with Amazon’s MTurk

Appraise can serve HITs itself or be integrated with Amazon’s Mechanical Turk (MTurk) in their External HIT mode.

Set up your data

The instructions below assume a particular data structure for the source segments, reference translations, and systems outputs.

$ROOT/
    sources/
        $ID-src.$SOURCE
    references/
        $ID-ref.$TARGET
    systems/
        $ID.$SOURCE-$TARGET.$SYSID

Here, $ROOT is the root path to this directory, $ID is some common identifier (e.g., “newstest2013”), $SOURCE and $TARGET are ISO 639-1 language identifiers, and $SYSID identifies each competing system. The files in each directory are all line-by-line parallel.

Generate controls

When creating batches, there is an option to randomly insert controls in the form of high-consensus rankings that can be used to ferret out untrustworthy Turkers. This is not quite documented yet.

Generate batches

The first step is to generate the batch data. This is a set of XML files, which are then imported into Appraise (next step). You will invoke the following script:

perl $APPRAISE/scripts/make_mturk_batch.pl $BATCHNO $SOURCE $TARGET

where $BATCHNO is the batch number (a nonnegative number).

The script is just a wrapper around the code ($APPRAISE/scripts/wmt_ranking_task.py) that does the real work of batch generation. You will want to edit it to define the following things: the location of your input data, the location of your controls, and the probability of inserting a control at each step.

(Since controls are not yet documented, you should remove the arguments -controls and -control_prob from the internal invocation of the wmt_ranking_task.py script.

This step produces XML files, each containing a batch. By default, this file will be found at

./$SOURCE-$TARGET/$SOURCE-$TARGET-batch$BATCHNO.txt

(Yes, an XML extension would have been more fitting).

Import batches into Appraise

Preface

We run two different evaluation “modes”: first, we collect annotations from our researchers, second, we farm out the evaluation task to MTurk. Batches containing HITs for both tasks will be stored inside Appraise’s database. The difference is that HITs for MTurk are “hidden” from local annotators (= researchers) in order to prevent confusing our computation of HIT completion status.

Batch files need to be valid XML files which pass validation in appraise.wmt14.validators.validate_hits_xml_file()

Researcher batches

1a. Validate batch file before import using “dry run”

$ python import_wmt14_xml.py --dry-run FILE_TO_IMPORT.EXT > dry-run-import.log

`--dry-run`: Enable dry run to simulate import.

Dry run allows to spot any validation errors without crashing the DB. For researcher batches, this is it. Wait for dry run to complete and take note of any validation errors (and fix them ;))

The import script provides additional usage information when being run without arguments or with -h or --help in the command line. Typically, --wait SLEEP_SECONDS is helpful…

It is possible to specify more than one file to be imported.

MTurk batches

1b. Validate batch file before import using “dry run” and “MTurk mode”

$ python import_wmt14_xml.py --dry-run --mturk-only FILE.EXT > dry-run-import.log

`--mturk-only`: Enable MTurk-only flag for all HITs.

We want to activate MTurk mode as MTurk HITs should only be completed by MTurkers and, thus, need to be invisible to local Appraise users.

After validation, run the same command without --dry-run to perform the actual import into Django’s database. It is highly recommended to create a log file for later reference!
Export MTurk HIT IDs to CSV file so that they can be published. You can either run appraise.wmt14.admin.export_hit_ids_to_csv() in a Django shell or perform the corresponding “Export selected HIT ids to CSV” action inside the Django admin backend.
Hand over CSV file containing MTurk IDs to Matt for publication.

(Note: export format can be changed if that eases MTurk import. Drop me a note.)

Closing notes

It makes sense to check HIT status every now and then and to disable finished HITs as this gives a slight speedup for QuerySet lookups. This is easy in the Django shell, but I might add a script for that as well.

(Note: wmt14 versions of scripts and Appraise app will appear on GitHub shortly)

Publish to MTurk

This step is a bit unpolished and therefore somewhat painful. It involves taking the CSV files generated above and using the MTurk command-line tools to submit and retrieve the results. This is accomplished via a host of shell scripts that attempt to ease the uploading, retrieval, closing, and deleting of HITs. It is a file-based management system used for accounting for which HITs are submitted, rejected, approved, and so on.

The CSV file you get from the step above contains a single column with a header, something like:

appraise_id
004e8dca
01c5fa0a
…

All the scripts for this section are in $APPRAISE/scripts/mturk. The MTurk metadata files (and external HIT and question files) are in $APPRAISE/examples/mturk. You will likely have to update data locations to make this work.

Probably all of this could be improved with new tools, and I bet the command-line interface is better.

Downloading the CLI tools

Note: it appears the command-line tools may have been updated since last year. We used version 1.3.1, which I can’t seem to find on Amazon’s terrible MTurk documentation website any more. The closest I can find is this, which is a bit older and contains only a Windows executable installer). If you don’t already have these and can’t find them, I can give you mine.

They unpacked into a directory named aws-mturk-clt-1.3.1/. I will call this directory $MTURK.

Submission

The main wrapper is submit.pl, which takes a batch number and related information and submits the data to MTurk. You can see the pathnames it assumes internally.

Usage:

submit.pl -Rf $SOURCE $TARGET $BATCHNO

This submits the batch, creating a directory $SOURCE-$TARGET/batch$BATCHNO/ with lots of MTurk metadata that is used to track what’s going on.

NOTE. The scripts assume an off-by-1 error, where the XML file with batch number N used a CSV file with batch number N-1. This should be fixed according to Christian’s docs above, so edit submit.pl to use the actual $BATCHNO instead of subtracting 1.

Downloading

See download.sh. It calls the MTurk CL tool script getResults.sh using the metadata created for each batch that has been submitted. It does this for all batches that have not been “closed” (see below).

It then combines them and uses them to build stats, including author metadata, which can be used to reject. These tools are primitive, so here is where you might want to insert your own. For example, it doesn’t do things that Maise did, like compute the percentage of time each Turker chose the first option. But it does give some info.

It then builds a report and lists HITs that will be rejected.

Rejecting

You can then call $MTURK/bin/rejectWork.sh to reject the work of Turkers you don’t like, which list you should have built after the last step.

Approving

Call $MTURK/bin/approveWork.sh on the “success file” generated when you uploaded the HITs. Only work that hasn’t been rejected will be approved, so I usually do the rejecting first, then batch-approve everything.

for file in $(find . -name *.input.success); do  
    $MTURK/bin/approveWork.sh -successfile $file
done

But note that this will approve things that have accumulated in the meantime, unfortunately. I believe the tools will let you approve based on assignment IDs, but never got around to this.

Close batches

When a batch has been entirely approved, you can “close” for local accounting purposes. This will make the approval step faster. Closing is just a matter of renaming the success file:

$APPRAISE/scripts/mturk/close_batches.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly