On the one hand, we have the DjangoCon videos on Youtube and on the other hand, the transcript of the talks provided by the Speech To Text translators. We'd like to make subtitles.
This proof of concept :
- Downloads the automatic Youtube captions for a given video
- Matches each block of subtitle to its equivalent in the transcripts using DiffLib
- Produces a Subtitles files ready for re-uploading
- Take it a step further and upload the SRT back to Youtube (easy).
- Create a proper command line tool that, given a Youtube video id (or link) and its .txt transcript, does it all
- Create a Django App to do the same thing AND manage OAuth2 properly. For now, one has to copy-paste manually the token.
Youtube actually provides this functionnality ("sync=true" in the insert API, similar on the website) but I'm an expert for discovering this sort of things after writing the code.
A proof of concept ipython notebook. and the srt files produced.