My workflow for linting YouTube transcripts #577

hippietrail · 2025-02-03T04:55:09Z

hippietrail
Feb 3, 2025

YouTube videos are full of real English and also full of common mistakes.

Last year I made a little tool to grab YouTube transcripts to run byte pair encoding on, as used by LLMs. And I just realized it'd be useful for testing Harper too. It's in a GitHub gist here.

Transcripts tend to lack punctuation, which makes them look like one giant sentence to Harper and the "sentence too long" lint overrules all other lints. We can avoid that by turning every line into a sentence. It doesn't have to be perfect since we're not correcting the transcripts, we're just looking for things Harper misses, and things it gets wrong. To change each line into a sentence I'm using the *nix tool awk:

bun run fetch-youtube-transcript.ts 9JoXbpKkxBM | awk '{ line = toupper(substr($0, 1, 1)) substr($0, 2); if (line ~ /\.$/) { print line } else { print line "." } }' > harper-test/youtube-transcript.md

As you can see I redirect it to a file with an .md extension since Markdown is minimally enhanced plain text and plain text doesn't work with harper-cli yet. So then do this:

harper/target/release/harper-cli lint harper-test/youtube-transcript.md|less -r

There's quite a bit of stuff to ignore, but you'll also find ideas for new lints, and you'll find false positives in some of the current lints too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

My workflow for linting YouTube transcripts #577

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

My workflow for linting YouTube transcripts #577

hippietrail Feb 3, 2025

Replies: 0 comments

hippietrail
Feb 3, 2025