Subsearch is an API for indexing Youtube subtitles and searching them.
Given a Youtube video, the TTML subtitles are downloaded via yt-dlp.
They are parsed, translated to a JSON document and fed into Elasticsearch.
When searching for a phrase, the ES index is queried. Looking at the ID, highlights and timestamps, the relevant video and timestamp is found and a link is built.
There's also an external crawler that uses Google Trends to find hot topics and keywords to feed to the API.
First of all, start the API.
./api.py
curl "localhost:2000/search/emancipate"
curl -XPOST "localhost:2000/request_download/jNQXAC9IVRw"
curl -XPOST "localhost:2000/request_download/@TheOffice"
This downloads the subs from the first 10 results for the query query
.
You can change the 10 if you want to download a different number of videos.
curl -XPOST "localhost:2000/request_download/ytsearch10:query"
If you want to keep growing the index you need to constantly supply the API with new videos.
I made a crawler to get hot topics and keywords from Google Trends and feed them to the API.
To run it:
./crawler.py