Skip to content

An app for creating transcript pages in the wiki for the Skeptic's Guide to the Universe podcast episodes.

License

Notifications You must be signed in to change notification settings

mheguy/transcription-bot

Repository files navigation

transcription-bot

Transcript creation

Wiki maintainance

This has transformed into a small monorepo as I wanted to add some wiki maintenance tasks to the bot.
All tools are a fan creation and are neither endorsed by nor associated with the creators of the podcast.

create_podcast_transcript

This tool creates transcripts for episodes of the podcast "The Skeptic's Guide to the Universe".

The output

You can view the output of the bot here: https://www.sgutranscripts.org/w/index.php?title=SGU_Episode_1009&oldid=19910

All new episodes are automatically transcribed. A human editor will typically clean up issues that the bot was not able to resolve itself. Therefore, the link provided is a link to a specific version of that page that the bot created.

How it works

This explanation is targeted toward those who have no experience writing or reading code.

Transcription

We create a transcription of an episode. The transcription contains the words that were said as well as the time at which they were spoken.
That looks a bit like this:

0:01.234 - And
0:01.450 - that's
0:01.750 - why

For the sake of our demo, here are the words that we extracted without the timestamps.

And what's why taxes are so fascinating I'm not sure I agree alright, time for a quickie with Bob yes folks we're going to talk about lasers the pew pew kind?

The next layer is diarization. That tells us when a different person is speaking.

SPEAKER_01: 0:01-0:03
SPEAKER_04: 0:04-0:06
SPEAKER_03: 0:07-0:09
SPEAKER_02: 0:10-0:14
SPEAKER_05: 0:14-0:15

We merge the transcription and the diarization:

SPEAKER_01: And what's why taxes are so fascinating.
SPEAKER_04: I'm not sure I agree.
SPEAKER_03: Alright, time for a quickie with Bob.
SPEAKER_02: Yes folks we're going to talk about lasers.
SPEAKER_05: The pew pew kind?

And then we apply voiceprints we have on file to identify the speakers.

Evan: And what's why taxes are so fascinating.
Cara: I'm not sure I agree.
Steve: Alright, time for a quickie with Bob.
Bob: Yes folks we're going to talk about lasers.
Jay: The pew pew kind?

At this point, the transcription is completed and we have what is internally called a "diarized transcript".

Segment Data Gathering

The bot has information about all the recurring segment types.
But it needs to know what segments a particular episode contains.

To figure this out, we need data.
The two sources that we use for this data are the show notes web page, and the embedded lyrics in the episode mp3 file.

By combining the data from those two sources, we know what segments the episode contains and the order they are in.

Segmenting the Transcript

To continue the example from above, let's say we know that this episode has a "Quickie" segment.
The bot is programmed to look for the words "quickie with" to find the transition point into the segment.
This enables us to break the full transcript into the episode segments.

Cara: I'm not sure I agree.
== Quickie with Bob: Lasers == Steve: Alright, time for a quickie with Bob.

We use templates to ensure that we match the desired formatting for the wiki.

When Segmenting is Tricky

It's tricky to identify transitions into news segments. We have no "key words" that reliably tell us when a transition is happening.
So for this case, and as a fallback for all segment types when heuristics don't work, we send a chunk of transcript to GPT and ask it to identify the transition.

Other Odds and Ends

We download the image from the show notes page and upload it to the wiki. We add a caption that is generated by GPT (this results in something pretty bland and not specific to the episode).
We load the links to extract the article titles which are used in the references at the bottom of the wiki pages.

update_wiki_episode_lists

This tool helps maintain the episode lists within https://www.sgutranscripts.org

What It Does

Every 24 hours it gets the most recent changes to the wiki. For each modified episode page it:

  • ensures the episode is present in the appropriate year's page
  • updates the status of the episode based on the templates used in the episode page

Specifically, for the status column it will overwrite the existing value. For all other columns it will only add information, never overwrite or delete it.

Development

uv is used to manage dependencies: https://docs.astral.sh/uv/getting-started/installation/
Once installed, uv sync is all you need to run.

There are a number of required env vars to run the tool. dotenv is set up, so we can place our variables into a .env file.
See .env.sample for required env vars.

Ruff and Pyright should be used for linting, formatting, and type checking: uv run ruff check ; uv run ruff format ; uv run pyright

About

An app for creating transcript pages in the wiki for the Skeptic's Guide to the Universe podcast episodes.

Resources

License

Stars

Watchers

Forks