Proposer
22/11/2018
In Natural Language Processing (NLP), training of robust neural models requires a significant amount of data, which can be created artificially. However, models to artificially generate data may end up reproducing text segments from the training text used to build the models. This can be a problem particularly in the cases when training instances contain sensitive information, such as names of patients. Increasing lexical variety of artificial text is an exciting research subject. The objective of this work will be to study two existing text generation approaches: a variational learning approach (http://aclweb.org/anthology/D18-1354) and an adversarial learning approach (http://aclweb.org/anthology/N18-1122). You will apply them to the generation of Amazon Product Reviews on Electronics (http://jmcauley.ucsd.edu/data/amazon). As the first approach is specifically designed to generate diverse text, you will investigate how to increase lexical variety using the second approach. To be more precise, you will change the reinforced training objective to address semantic similarity. The main activities in the project will be: study the approaches (http://aclweb.org/anthology/D18-1354) and (http://aclweb.org/anthology/N18-1122), re-implement the first approach (using a deep learning framework, for instance, PyTorch) and extend the second approach to enrich the variety of generated text.
No special requirements.
Contact the supervisor directly.
2