My workflow for linting YouTube transcripts #577
hippietrail
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
YouTube videos are full of real English and also full of common mistakes.
Last year I made a little tool to grab YouTube transcripts to run byte pair encoding on, as used by LLMs. And I just realized it'd be useful for testing Harper too. It's in a GitHub gist here.
Transcripts tend to lack punctuation, which makes them look like one giant sentence to Harper and the "sentence too long" lint overrules all other lints. We can avoid that by turning every line into a sentence. It doesn't have to be perfect since we're not correcting the transcripts, we're just looking for things Harper misses, and things it gets wrong. To change each line into a sentence I'm using the *nix tool
awk
:As you can see I redirect it to a file with an
.md
extension since Markdown is minimally enhanced plain text and plain text doesn't work withharper-cli
yet. So then do this:harper/target/release/harper-cli lint harper-test/youtube-transcript.md|less -r
There's quite a bit of stuff to ignore, but you'll also find ideas for new lints, and you'll find false positives in some of the current lints too.
Beta Was this translation helpful? Give feedback.
All reactions