NMT Program Mining

LODA Program Mining using Neural Machine Translation (NMT). Contributed by JUrban and barakeel.

Create Training Data

The training data is generated from loda-programs and stored in the TRDIR folder.

export TRDIR=/traindata
cd loda-programs/oeis
for i in `ls -d */`; do cd $i; for j in *; do perl -pe 's/;.*//; s/^ *//; s/mov/M/; s/add/A/; s/sub/S/; s/trn/T/; s/mul/U/; s/div/D/; s/dif/F/; s/mod/O/; s/pow/P/; s/gcd/G/; s/bin/B/; s/cmp/C/; s/min/I/; s/max/X/; s/lpb/L/; s/lpe/E/; s/clr/R/; s/seq/Q/; s/,/# /g; s/[\$]/\$ /g; s/([0-9])/$1 /g;  s/\-/~ /g;' $j | tr "\n" " " | perl -pe 's/^ *//; s/ +/ /g' > $TRDIR/$i/$j; done; cd ..; done

Translate back to LODA:

perl -pe 's/M/; mov/g; s/A/; add/g; s/S/; sub/g; s/T/; trn/g; s/U/; mul/g; s/D/; div/g; s/F/; dif/g; s/O/; mod/g; s/P/; pow/g; s/G/; gcd/g; s/B/; bin/g; s/C/; cmp/g; s/I/; min/g; s/X/; max/g; s/L/; lpb/g; s/E/; lpe/g; s/R/; clr/g; s/Q/; seq/g; s/# */,/g; s/[\$] */\$/g; s/([0-9]) */$1/g; s/\~ */-/g; s/^; *//; s/^ *//; s/ *$//; s/ +/ /g'

Model Training

The folder TRDIR contains the training data and MODDIR is where the model shall be stored.

export CUDA_VISIBLE_DEVICES=0
time python -m nmt.nmt \
  --encoder_type=bi \
  --learning_rate=0.4 \
  --max_gradient_norm=3.0 \
  --attention=scaled_luong \
  --num_units=512 \
  --num_gpus=1 \
  --batch_size=512 \
  --src=miz --tgt=ferp \
  --vocab_prefix=$TRDIR/vocab \
  --train_prefix=$TRDIR/train1 \
  --dev_prefix=$TRDIR/dev \
  --test_prefix=$TRDIR/test \
  --out_dir=$MODDIR/model \
  --num_train_steps=20000 \
  --steps_per_stats=20000 \
  --steps_per_external_eval=20000 \
  --tgt_max_len=180 \
  --tgt_max_len_infer=280 \
  --src_max_len=100 \
  --num_layers=2 \
  --dropout=0.2 \
  --metrics=bleu \
  --beam_width=240 \
  --num_translations_per_input=240 > $TRDIR/train.log-tr-1

Generate Programs (Inference)

time python -m nmt.nmt --beam_width=240 --num_translations_per_input=240 --out_dir=$MODDIR/model --inference_input_file=$inp --inference_output_file=$out >> $TRDIR/train.log-inf;

$inp is a file with the OEIS sequences for which we want program predictions encoded like this:

_ # _ # _ # _ # _ # _ # 1 2 6 0 # 5 #
~ 1 2 8 # ~ 1 2 8 # 1 9 2 # ~ 1 2 8 # 3 2 # 3 2 # ~ 4 8 # 3 2 # ~ 8 # ~ 8 # 1 2 # ~ 8 # 2 # 2 # ~ 3 # 2 #
1 # 1 # 1 # 1 # 1 # 1 # 1 # 0 # 1 # 1 # 1 # 0 # 0 # 1 # 0 # 0 #
7 5 # 6 6 # 5 7 # 4 8 # 4 1 # 3 4 # 2 7 # 2 2 # 1 7 # 1 2 # 9 # 6 # 3 # 2 # 1 # 0 #

We only use 16 input values (this is arbitrary) and turn any values with more than 6 digits into "_" . "-" (minus) becomes "~" .

We also reverse the sequence (probably not needed with our current NMT params).

Validating and Submitting Programs

Add a generator with "version": 8 to the LODA miners.json config. Set the batchFile to the file that contains your generated programs. Add a new miner config (nmt below) and reference the new generator by its name.

{
  "miners": [
    {
      "name": "nmt",
      "overwrite": "none",
      "enabled": true,
      "backoff": true,
      "generators": [
        "v8"
      ],
      "matchers": [
        "direct",
        "linear1",
        "linear2",
        "binary",
        "decimal"
      ]
    },
  ],
  "generators": [
    {
      "name": "v8",
      "version": 8,
      "batchFile": "/path/to/oeis-loda-nmt-predictall"
    }
  ]
}

Run the validation (if you run in client mode, found programs will be automatically submitted):

loda mine -i nmt

To parallelize it, you need to edit the miners.json file:

create multiple "generators" where each point to a different chunk
create multiple "miners" (with different names) each pointing to another generator

To limit the evaluation time per program, use -z option. To allow updating existing programs (slower), set overwrite to auto or all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

README.md

README.md

NMT Program Mining

Create Training Data

Model Training

Generate Programs (Inference)

Validating and Submitting Programs

Collapse file tree

Files

README.md

Latest commit

History

README.md

File metadata and controls

NMT Program Mining

Create Training Data

Model Training

Generate Programs (Inference)

Validating and Submitting Programs