This is our final project for course Pattern Recognition, and here is our abstract, and in the pr-final-project.pdf
In this project, we come up with an idea that combines three neural network models to achieve a speech-emotion to facial-expression translation. We propose a CNN network for audio signal processing and recognition, a GAN for generating the final result with the emotion and facial expression, a pre-trained CNN for our GAN's enhancement. Our pilot approach is to use cycleGAN combining the result of CNN as the restricting-attributes for cycleGAN. It restricts the condition of GAN such that we can get expected results from affecting GAN by not only the data distribution that generator learned but also other kinds of attributes from the multi-modal feature. We will use different datasets to evaluate and test our model and make sure the robustness and accuracy of our combined network. We achieve a purpose of "machine imagination" which can be interpreted as the speech-image translation.