This has transformed into a small monorepo as I wanted to add some wiki maintenance tasks to the bot.
All tools are a fan creation and are neither endorsed by nor associated with the creators of the podcast.
This tool creates transcripts for episodes of the podcast "The Skeptic's Guide to the Universe".
You can view the output of the bot here: https://www.sgutranscripts.org/w/index.php?title=SGU_Episode_1009&oldid=19910
All new episodes are automatically transcribed. A human editor will typically clean up issues that the bot was not able to resolve itself. Therefore, the link provided is a link to a specific version of that page that the bot created.
This explanation is targeted toward those who have no experience writing or reading code.
We create a transcription of an episode. The transcription contains the words that were said as well as the time at which they were spoken.
That looks a bit like this:
0:01.234 - And
0:01.450 - that's
0:01.750 - why
For the sake of our demo, here are the words that we extracted without the timestamps.
And what's why taxes are so fascinating I'm not sure I agree alright, time for a quickie with Bob yes folks we're going to talk about lasers the pew pew kind?
The next layer is diarization. That tells us when a different person is speaking.
SPEAKER_01: 0:01-0:03
SPEAKER_04: 0:04-0:06
SPEAKER_03: 0:07-0:09
SPEAKER_02: 0:10-0:14
SPEAKER_05: 0:14-0:15
We merge the transcription and the diarization:
SPEAKER_01: And what's why taxes are so fascinating.
SPEAKER_04: I'm not sure I agree.
SPEAKER_03: Alright, time for a quickie with Bob.
SPEAKER_02: Yes folks we're going to talk about lasers.
SPEAKER_05: The pew pew kind?
And then we apply voiceprints we have on file to identify the speakers.
Evan: And what's why taxes are so fascinating.
Cara: I'm not sure I agree.
Steve: Alright, time for a quickie with Bob.
Bob: Yes folks we're going to talk about lasers.
Jay: The pew pew kind?
At this point, the transcription is completed and we have what is internally called a "diarized transcript".
The bot has information about all the recurring segment types.
But it needs to know what segments a particular episode contains.
To figure this out, we need data.
The two sources that we use for this data are the show notes web page, and the embedded lyrics in the episode mp3 file.
By combining the data from those two sources, we know what segments the episode contains and the order they are in.
To continue the example from above, let's say we know that this episode has a "Quickie" segment.
The bot is programmed to look for the words "quickie with" to find the transition point into the segment.
This enables us to break the full transcript into the episode segments.
Cara: I'm not sure I agree.
== Quickie with Bob: Lasers == Steve: Alright, time for a quickie with Bob.
We use templates to ensure that we match the desired formatting for the wiki.
It's tricky to identify transitions into news segments. We have no "key words" that reliably tell us when a transition is happening.
So for this case, and as a fallback for all segment types when heuristics don't work, we send a chunk of transcript to GPT and ask it to identify the transition.
We download the image from the show notes page and upload it to the wiki. We add a caption that is generated by GPT
(this results in something pretty bland and not specific to the episode).
We load the links to extract the article titles which are used in the references at the bottom of the wiki pages.
This tool helps maintain the episode lists within https://www.sgutranscripts.org
Every 24 hours it gets the most recent changes to the wiki. For each modified episode page it:
- ensures the episode is present in the appropriate year's page
- updates the status of the episode based on the templates used in the episode page
Specifically, for the status column it will overwrite the existing value. For all other columns it will only add information, never overwrite or delete it.
uv is used to manage dependencies: https://docs.astral.sh/uv/getting-started/installation/
Once installed, uv sync
is all you need to run.
There are a number of required env vars to run the tool. dotenv
is set up, so we can place our variables into a .env
file.
See .env.sample
for required env vars.
Ruff and Pyright should be used for linting, formatting, and type checking: uv run ruff check ; uv run ruff format ; uv run pyright