An extensible, clean implementation of DocumentQA, and a basis for developing RCQA models
- Prepare harness for tokenization, batch building and evaluation
- Make a basic LSTM->Dense->Spans&no-answer outputting model to get the whole training/testing process running
- Think about data cleanup, tokenization and all the other shenanigans of working with SQuAD
- Lowercasing
- Dealing with abbreviations
- Dealing with numbers, dates etc
- Add encoding of character-level info as well as word-level info
- Add unit testing for core components
- Make GPU compatible
- Add option to read in a single answer span per question for training
- Make a distinction between train and non-train datasets for proper handling of char/word -> idx mappings
- Write dev validation during training
- Implement BiDAF on top
- Implement self attention as described in DocQA
- Implement memory and runtime profiling
- Add max context size
- Do proper dropout
- Test implementation with self attention
- Do better structured config objects to pass around instead of bajillion parameters as it is used now
- Implement char CNN for char embeddings
- Reproduce DocQA Performance
- Add the option to output no-answer probabilities with the output
- Add encoding of sentence-level info
- Integrate ELMo vectors