*by IBM Research in Cambridge and Harvard SEAS -- more info seq2seq-vis.io
- Seq2Seq-Vis
- Cite us
- Contributors
- License
- V 0.9 beta -- end April 2018
- V 1.0 -- summer 2018
- V 2.0 -- summer 2019
We require using miniconda to create a virtual environment and install all dependencies via scripts. Seq2Seq-Vis currently works with a special version of OpenNMT-py modified version by Sebastian Gehrmann. We provide a script to install this special branch.
git clone https://github.com/HendrikStrobelt/Seq2Seq-Vis.git
cd Seq2Seq-Vis
and run in /Seq2Seq-Vis
:
source setup_cpu.sh
cd ..
source Seq2Seq-Vis/setup_onmt_custom.sh
Here we provide some example data for a character based dataset which converts date strings (e.g. "March 03, 1999" , "03/03/99") into a base form "mm-dd-yyyy". Download here ~177MB and unzip it in /Seq2Seq-Vis
unzip fakedates.zip
python3 server.py --dir 0316-fakedates/
go here: http://localhost:8080/client/index.html?in=M a r c h _ 0 3 , 1 9 9 9
You should see:
Enjoy exploring !
Thanks, Samuel Gratzl for contributing a docker configuration and image. Here are the steps:
- pull image:
docker pull sgratzl/seq2seq-vis
- download data Download here ~177MB
and unzip:
unzip fakedates.zip
- run container with bound data:
docker run --rm -it -v "${PWD}/0316-fakedates:/data" -p "8080:8080" sgratzl/seq2seq-vis
You can use any model trained with OpenNMT-py to extract your own data. To gain access to the extraction scripts, follow the instructions above to install the modified OpenNMT-py version.
First, create a folder s2s
that will be used to save all the extractions by calling mkdir s2s
.
Then, call
python extract_context.py -src $your_input_file \
-tgt $your_target_file \
-model $your_model.pt \
-gpu $your_GPU_id (can be ignored for CPU extraction) \
-batch_size $your_batch_size
You can customize the maximum sequence lengths by setting max_src_len
, and max_tgt_len
in the script. If you want to restrict the number of examples in your state file, you can uncomment the following lines and set it to your desrired size:
# if bcounter > 100:
# break
The script creates a file in the location s2s/states.h5
. This file is what you need to create the indices for searching.
The file for this is located in this directory in scripts/h5_to_faiss.py
.
Call it three times (once for each type of state) with the parameters
-states s2s/states.h5 # Your states file location
-data [decoder_out, encoder_out, cstar] # The three datasets within the states h5 file
-output $your_index_name # We recommend just naming them decoder.faiss, encoder.faiss, and context.faiss
-stepsize 100 # you can increase this, this is the number of batches it will add to the index at once. It is bottlenecked by your memory
To generate the dictionary and embedding files, modify this line with the location of your model and call
python VisServer.py
This will also test that your model works with our server as it calls the same API. The script will create three files:
- s2s/embs.h5
- s2s/src.dict
- s2s/tgt.dict
# -- minimal config
model: date_acc_100.00_ppl_1.00_e7.pt # model file
dicts:
src: src.dict # source dictionary file
tgt: tgt.dict # target dictionary file
embeddings: embs.h5 # word embeddings for src and tgt
train: train.h5 # training data
# -- OPTIONAL: FAISS indices for Neighborhoods
indexType: faiss # index type should be 'faiss' (or 'annoy')
indices:
decoder: decoder.faiss # index for decoder states
encoder: encoder.faiss # index for encoder states
# -- OPTIONAL: model for linear projection
project_model: linear_projection.pkl # pickl-ed scikit-learn model
usage: server.py [-h] [--nodebug NODEBUG] [--port PORT]
[-dir DIR]
optional arguments:
--nodebug TRUE if not in debug mode
--port port to run system (default: 8080)
--dir directory with s2s.yaml file
@ARTICLE{seq2seqvisv1,
author = {{Strobelt}, H. and {Gehrmann}, S. and {Behrisch}, M. and {Perer}, A. and {Pfister}, H. and {Rush}, A.~M.},
title = "{Seq2Seq-Vis: A Visual Debugging Tool for Sequence-to-Sequence Models}",
journal = {ArXiv e-prints},
archivePrefix = "arXiv",
eprint = {1804.09299v1},
primaryClass = "cs.CL",
keywords = {Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing},
year = 2018,
month = April
}
-
Hendrik Strobelt (IBM Research & MIT-IBM Watson AI Lab)
-
Sebastian Gehrmann (Harvard NLP)
-
Alexander M. Rush (Harvard NLP)
-
Michael Behrisch (Harvard VCG), Adam Perer (IBM Research), Hanspeter Pfister (Harvard VCG)
-
PR #16 signed-off-by: Samuel Gratzl
Seq2Seq-Vis is licensed under Apache 2 license.