Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YouTube] Fix hashtags links extraction and escape HTML links #1032

Merged

Conversation

AudricV
Copy link
Member

@AudricV AudricV commented Feb 24, 2023

webCommandMetadata object is contained inside a commandMetadata one, so it is not accessible from the root of the navigationEndpoint object.

The corresponding statement has been moved at the bottom of the specific endpoints parsing, as the webCommandMetadata object is present almost everywhere, otherwise URLs of some endpoints would have be changed, such as uploader URLs (from channel IDs to handles).

As no ParsingException is now thrown by getUrlFromNavigationEndpoint, and so by getTextFromObject, getUrlFromObject and getTextAtKey, the methods which were catching ParsingExceptions thrown by these methods had to be updated.

URLs got in the HTML version of getTextFromObject are now escaped properly to provide valid HTML to clients. This has been also done for attribute descriptions, with the description text for this type of descriptions.

As YouTube descriptions are in HTML format (except for the fallback on the JSON player response, which is plain text and only happens when there is no visual metadata or a breaking change), URLs returned are escaped, so tests which are testing presence of URLs with escaped characters had to be updated (it was only the case for YoutubeStreamExtractorDefaultTest.DescriptionTestUnboxing).

I've also updated the mocks of two tests classes of YoutubeCommentsExtractorTest which were missing, in order to test completely my changes: RepliesTest and FormattingTest.

Fixes #1019 (for real this time)
Related issue: TeamNewPipe/NewPipe#9774

@AudricV AudricV added bug Issue is related to a bug youtube service, https://www.youtube.com/ labels Feb 24, 2023
@AudricV AudricV marked this pull request as draft February 24, 2023 21:35
…descriptions + HTML links

webCommandMetadata object is contained inside a commandMetadata one, so it is
not accessible from the root of the navigationEndpoint object.

The corresponding statement has been moved at the bottom of the specific
endpoints parsing, as the webCommandMetadata object is present almost
everywhere, otherwise URLs of some endpoints would have be changed, such as
uploader URLs (from channel IDs to handles).

As no ParsingException is now thrown by getUrlFromNavigationEndpoint, and so by
getTextFromObject, getUrlFromObject and getTextAtKey, the methods which were
catching ParsingExceptions thrown by these methods had to be updated.

URLs got in the HTML version of getTextFromObject are now escaped properly to
provide valid HTML to clients. This has been also done for attribute
descriptions, with the description text for this type of descriptions.

As YouTube descriptions are in HTML format (except for the fallback on the JSON
player response, which is plain text and only happens when there is no visual
metadata or a breaking change), all URLs returned are escaped, so tests which
are testing presence of URLs with escaped characters had to be updated (it was
only the case for YoutubeStreamExtractorDefaultTest.DescriptionTestUnboxing).
@AudricV AudricV force-pushed the yt_fix-comments-hashtags-links-extraction branch from 801f767 to a63f289 Compare February 26, 2023 17:51
@AudricV AudricV marked this pull request as ready for review February 26, 2023 17:57
Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough PR description. Code looks good to me. Thanks!

@Stypox Stypox merged commit 19e4b21 into TeamNewPipe:dev Mar 1, 2023
@AudricV AudricV deleted the yt_fix-comments-hashtags-links-extraction branch March 1, 2023 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is related to a bug youtube service, https://www.youtube.com/
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[YouTube] ParsingException for hashtags links in descriptions and comments
2 participants