Real-Time Speech to Image Generation

ASU HACKML 2022

This project was inspired by a paper called Zero-Shot Text-to-Image Generation, by Open AI, that introduces an autoregressive language model called DALL·E trained on images broken into segments that are given natural language descriptions.

Using the Assembly AI audio transcription API we are able to reproduce elements of the zero-shot capabilities presented in the DALL·E paper, in real-time, using a much less complex model that is trained on a summarized version of the training data they refer to as a meta-data set in the paper Less is More: Summary of Long Instructions is Better for Program Synthesis.

This is only possible with the combination of robust features present in the Assembly AI API that allow for seamless integration with the machine learning model and web interface framework as well as the corrective language modeling they use to repair malformed input and isolate sentences as they are spoken.

Install and Build the Environment

You need Node.js

https://nodejs.org/en/download/package-manager/

starting from the top level directory of `/MLHackathon2022/`

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cd realtime-transcription
npm install
open a separate terminal and run npm run server
open a separate terminal and run npm run client
cd ../Lafite switch to the Lafite directory
sudo wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1tMD6MWydRDMaaM7iTOKsUK-Wv2YNDRRt' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1tMD6MWydRDMaaM7iTOKsUK-Wv2YNDRRt" -O "COCO2014_CLIP_ViTB32_all_text.pkl" && rm -rf /tmp/cookies.txt
open a separate terminal and run python model.py

References and Resources

Assembly AI Real-Time Streaming Transcription

Zero-Shot Text-to-Image Generation

https://github.com/lucidrains/DALLE-pytorch

This OpenAI paper analyzes zero-, one-, and few-shot performance on Natural Language Processing tasks using autoregressive language modeling.

@article{DBLP:journals/corr/abs-2102-12092,
  author    = {Aditya Ramesh and
               Mikhail Pavlov and
               Gabriel Goh and
               Scott Gray and
               Chelsea Voss and
               Alec Radford and
               Mark Chen and
               Ilya Sutskever},
  title     = {Zero-Shot Text-to-Image Generation},
  journal   = {CoRR},
  volume    = {abs/2102.12092},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.12092},
  eprinttype = {arXiv},
  eprint    = {2102.12092},
  timestamp = {Tue, 02 Mar 2021 12:11:01 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2102-12092.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

LAFITE: Towards Language-Free Training for Text-to-Image Generation

https://github.com/drboog/Lafite

The model proposed in this paper is able to produce similar performance on zero-shot benchmarks with 1% of the model size and training data used by the DALL·E model.

@article{DBLP:journals/corr/abs-2111-13792,
  author    = {Yufan Zhou and
               Ruiyi Zhang and
               Changyou Chen and
               Chunyuan Li and
               Chris Tensmeyer and
               Tong Yu and
               Jiuxiang Gu and
               Jinhui Xu and
               Tong Sun},
  title     = {{LAFITE:} Towards Language-Free Training for Text-to-Image Generation},
  journal   = {CoRR},
  volume    = {abs/2111.13792},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.13792},
  eprinttype = {arXiv},
  eprint    = {2111.13792},
  timestamp = {Thu, 02 Dec 2021 09:37:06 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-13792.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Training Generative Adversarial Networks with Limited Data

https://github.com/NVlabs/stylegan2-ada-pytorch

This paper presents a more efficient pretraining paradigm that uses a summarized version of training data to reduce superfluous relationships in the data.

@article{DBLP:journals/corr/abs-2006-06676,
  author    = {Tero Karras and
               Miika Aittala and
               Janne Hellsten and
               Samuli Laine and
               Jaakko Lehtinen and
               Timo Aila},
  title     = {Training Generative Adversarial Networks with Limited Data},
  journal   = {CoRR},
  volume    = {abs/2006.06676},
  year      = {2020},
  url       = {https://arxiv.org/abs/2006.06676},
  eprinttype = {arXiv},
  eprint    = {2006.06676},
  timestamp = {Wed, 17 Jun 2020 14:28:54 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2006-06676.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Learning Transferable Visual Models From Natural Language Supervision

https://github.com/openai/CLIP

CLIP is an OpenAI model that is able to produce an evaluation metric for good vs bad images necessary for the GANs to properly evaluate images used in the generation aspects of the other papers.

@article{DBLP:journals/corr/abs-2103-00020,
  author    = {Alec Radford and
               Jong Wook Kim and
               Chris Hallacy and
               Aditya Ramesh and
               Gabriel Goh and
               Sandhini Agarwal and
               Girish Sastry and
               Amanda Askell and
               Pamela Mishkin and
               Jack Clark and
               Gretchen Krueger and
               Ilya Sutskever},
  title     = {Learning Transferable Visual Models From Natural Language Supervision},
  journal   = {CoRR},
  volume    = {abs/2103.00020},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.00020},
  eprinttype = {arXiv},
  eprint    = {2103.00020},
  timestamp = {Thu, 04 Mar 2021 17:00:40 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2103-00020.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
Lafite		Lafite
realtime-transcription		realtime-transcription
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Speech to Image Generation

ASU HACKML 2022

Install and Build the Environment

You need Node.js

starting from the top level directory of `/MLHackathon2022/`

References and Resources

Assembly AI Real-Time Streaming Transcription

Zero-Shot Text-to-Image Generation

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Training Generative Adversarial Networks with Limited Data

Learning Transferable Visual Models From Natural Language Supervision

About

Releases

Packages

Contributors 2

Languages

thisistz/MLHackathon2022

Folders and files

Latest commit

History

Repository files navigation

Real-Time Speech to Image Generation

ASU HACKML 2022

Install and Build the Environment

You need Node.js

starting from the top level directory of /MLHackathon2022/

References and Resources

Assembly AI Real-Time Streaming Transcription

Zero-Shot Text-to-Image Generation

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Training Generative Adversarial Networks with Limited Data

Learning Transferable Visual Models From Natural Language Supervision

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

starting from the top level directory of `/MLHackathon2022/`

Packages