Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLS WebVTT subtitle downloading support #6144

Closed
wants to merge 2 commits into from
Closed

HLS WebVTT subtitle downloading support #6144

wants to merge 2 commits into from

Conversation

fstirlitz
Copy link
Contributor

This should resolve #6106.

The patches are fairly invasive. Since downloading HLS WebVTT subtitles seems to be relatively heavyweight bandwidth-wise, downloading them right away in the extractor is out of the question; but there was no other way to download these subtitles in an offline-viewable form. Hence the subtitle downloading code was hacked to make use of the downloader infrastructure and a custom HLS WebVTT downloader was written.

The code was tested with Python 3.4.3 and 2.7.10 on a few ComCarCoff videos.

@remitamine
Copy link
Collaborator

i want you to know that in this commit avformat/hlsenc: Add WebVtt support in hls has been added to ffmpeg.
also isn't better to change the _extract_m3u8_formats instead of creating two versions of the function as i said in this comment #6106 (comment) that i found this type of subtitles in abc7news and you don't always know when this type is used in a partucular site so a generic solution is better.

@fstirlitz
Copy link
Contributor Author

libavcodec still doesn't have any support for X-TIMESTAMP-MAP in WebVTT, so it's useless here. Extractors can be systematically modified to use the new method; I retained the old API only to avoid having to modify everything at once and make the patches easier to review. It can be done either way, I don't care.

Anyway, developers: is this any good? Should I adapt this to take advantage of #6392? @dstftw, @jaimeMF?

self.to_screen(
'[hlswebvtt] %s: Downloading manifest' %
(info_dict['id']))
data = self.ydl.urlopen(url).read()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work in case of live HLS WebVTT streams because you constantly get new subtitle segments at the same playlist URL.

It's a shame X-TIMESTAMP-MAP support patch hasn't been merged to ffmpeg yet after 3 years, but in my use-case (vlive.tv) it's not required, so dumping HLS WebVTT via ffmpeg works quite good.

_extract_m3u8_formats is renamed to _extract_m3u8_formats_and_subtitles
and extended to properly handle subtitle references; a wrapper with the
old name is provided for compatibility.

_parse_m3u8_formats is likewise renamed and extended, but without adding
the compatibility wrapper; the test suite is adjusted to test the enhanced
method instead.
@jsundram
Copy link

This code is awesome -- I had just started writing something just like this -- is there a reason it's not in master?

@fstirlitz
Copy link
Contributor Author

You tell me.

@gschaffner gschaffner mentioned this pull request Jul 27, 2020
9 tasks
@fstirlitz fstirlitz closed this Sep 12, 2020
pukkandan added a commit to yt-dlp/yt-dlp that referenced this pull request Apr 28, 2021
Authored by fstirlitz
Modified from: ytdl-org/youtube-dl#6144

Closes: #73
Fixes:
ytdl-org/youtube-dl#6106
ytdl-org/youtube-dl#14977
ytdl-org/youtube-dl#21438
ytdl-org/youtube-dl#23609
ytdl-org/youtube-dl#28132

Might also fix (untested):
ytdl-org/youtube-dl#15424
ytdl-org/youtube-dl#18267
ytdl-org/youtube-dl#23899
ytdl-org/youtube-dl#24375
ytdl-org/youtube-dl#24595
ytdl-org/youtube-dl#27899

Related:
ytdl-org/youtube-dl#22379
ytdl-org/youtube-dl#24517
ytdl-org/youtube-dl#24886
ytdl-org/youtube-dl#27215

Notes:
* The functions `extractor.common._extract_..._formats` are still kept for compatibility
* Only some extractors have currently been moved to using `_extract_..._formats_and_subtitles`
* Direct subtitle manifests (without a master) are not supported and are wrongly identified as containing video formats
* AES support is untested
* The fragmented TTML subtitles extracted from DASH/ISM are valid, but are unsupported by `ffmpeg` and most video players
    * Their XML fragments can be dumped using `ffmpeg -i in.mp4 -f data -map 0 -c copy out.ttml`.
        Once the unnecessary headers are stripped out of this, it becomes a valid self-contained ttml file
    * The ttml subs downloaded from DASH manifests can also be directly opened with <https://github.com/SubtitleEdit>
* Fragmented WebVTT files extracted from DASH/ISM are also unsupported by most tools
    * Unlike the ttml files, the XML fragments of these cannot be dumped using `ffmpeg`
    * The webtt subs extracted from DASH can be parsed by <https://github.com/gpac/gpac>
    * But validity of the those extracted from ISM are untested
nixxo pushed a commit to nixxo/yt-dlp that referenced this pull request Nov 22, 2021
Authored by fstirlitz
Modified from: ytdl-org/youtube-dl#6144

Closes: #73
Fixes:
ytdl-org/youtube-dl#6106
ytdl-org/youtube-dl#14977
ytdl-org/youtube-dl#21438
ytdl-org/youtube-dl#23609
ytdl-org/youtube-dl#28132

Might also fix (untested):
ytdl-org/youtube-dl#15424
ytdl-org/youtube-dl#18267
ytdl-org/youtube-dl#23899
ytdl-org/youtube-dl#24375
ytdl-org/youtube-dl#24595
ytdl-org/youtube-dl#27899

Related:
ytdl-org/youtube-dl#22379
ytdl-org/youtube-dl#24517
ytdl-org/youtube-dl#24886
ytdl-org/youtube-dl#27215

Notes:
* The functions `extractor.common._extract_..._formats` are still kept for compatibility
* Only some extractors have currently been moved to using `_extract_..._formats_and_subtitles`
* Direct subtitle manifests (without a master) are not supported and are wrongly identified as containing video formats
* AES support is untested
* The fragmented TTML subtitles extracted from DASH/ISM are valid, but are unsupported by `ffmpeg` and most video players
    * Their XML fragments can be dumped using `ffmpeg -i in.mp4 -f data -map 0 -c copy out.ttml`.
        Once the unnecessary headers are stripped out of this, it becomes a valid self-contained ttml file
    * The ttml subs downloaded from DASH manifests can also be directly opened with <https://github.com/SubtitleEdit>
* Fragmented WebVTT files extracted from DASH/ISM are also unsupported by most tools
    * Unlike the ttml files, the XML fragments of these cannot be dumped using `ffmpeg`
    * The webtt subs extracted from DASH can be parsed by <https://github.com/gpac/gpac>
    * But validity of the those extracted from ISM are untested
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for HLS WebVTT subtitles
4 participants