Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add video platform #36

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

add video platform #36

wants to merge 1 commit into from

Conversation

rom1504
Copy link
Owner

@rom1504 rom1504 commented Jan 6, 2023

does not actually work well

does not actually work well
@rom1504
Copy link
Owner Author

rom1504 commented Jan 6, 2023

ideas to fix:

  • whitelist platforms
  • classifier to exclude non video

@rom1504 rom1504 mentioned this pull request Apr 2, 2023

def valid_video_platform_link(link):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try something like this

import yt_dlp


FILTERED_EXTRACTORS = {ie.IE_NAME:ie for ie in yt_dlp.list_extractor_classes() 
                       if ie not in generic_extractors 
                       and "porn" not in ie.IE_NAME.lower()
                       and "adult" not in ie.IE_NAME.lower()
                       and "xxx" not in ie.IE_NAME.lower()
                       and "xvideos" not in ie.IE_NAME.lower()
                       and "xhamster" not in ie.IE_NAME.lower()
                       and "redtube" not in ie.IE_NAME.lower()
                       and "xtube" not in ie.IE_NAME.lower()
                       and "xstream" not in ie.IE_NAME.lower()
                       and "xfileshare" not in ie.IE_NAME.lower()
                       and "sex" not in ie.IE_NAME.lower()
                       }


# print(FILTERED_EXTRACTORS.keys())
# print(len(FILTERED_EXTRACTORS.keys()))

def is_link_valid(link, extractors):
    """Check if link is valid given a list of extractors."""
    return any([ie.suitable(link) for ie in extractors])

def valid_video_platform_link(link):
    """Check if link is a valid video platform link."""
    return link and is_link_valid(link, FILTERED_EXTRACTORS.values())


YT_URL = "https://www.youtube.com/watch?v=jLX0D8qQUBM"
DM_URL = "https://www.dailymotion.com/video/x29ryo7"
print(valid_video_platform_link(YT_URL))
print(valid_video_platform_link(DM_URL))

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic_extractors = [yt_dlp.extractor.generic.GenericIE,
yt_dlp.extractor.lazy_extractors.GenericIE]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried at #49

one first problem: running these thousands of regexes is actually quite slow. I guess we need to limit the list or find a way to merge them into one to speed things up (I think that should help?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also let's try the age_limit property in yt-dlp

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Another option is to do it in 2 steps

  • first checks that the domain is valid among ~2000 selected domains (from yt_dlp extractors _TESTS)
  • then checks if the url is a valid video url (wwith regex from yt_dlp)

This version is more than 100x faster (but is less exhaustive)

Copy link

@vinyesm vinyesm Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import yt_dlp
from urllib.parse import urlparse

generic_extractors = [yt_dlp.extractor.generic.GenericIE,
yt_dlp.extractor.lazy_extractors.GenericIE]

FILTERED_EXTRACTORS = {ie.IE_NAME:ie for ie in yt_dlp.list_extractor_classes()
if ie not in generic_extractors
and "porn" not in ie.IE_NAME.lower()
and "adult" not in ie.IE_NAME.lower()
and "xxx" not in ie.IE_NAME.lower()
and "xvideos" not in ie.IE_NAME.lower()
and "xhamster" not in ie.IE_NAME.lower()
and "redtube" not in ie.IE_NAME.lower()
and "xtube" not in ie.IE_NAME.lower()
and "xstream" not in ie.IE_NAME.lower()
and "xfileshare" not in ie.IE_NAME.lower()
and "sex" not in ie.IE_NAME.lower()
}

def extract_test(extractor):
tests = []
if hasattr(extractor, "_TEST"):
tests = [extractor._TEST["url"]]
elif hasattr(extractor, "_TESTS"):
tests = [x["url"] for x in extractor._TESTS]
return tests

def normalize_domain(domain):
domain = domain.lower()
if domain.startswith("www."):
domain = domain[4:]
return domain

def extract_domain(url):
parsed_url = urlparse(url)
domain = parsed_url.netloc
return normalize_domain(domain)

DOMAIN_DICT = {}

for extractor in FILTERED_EXTRACTORS.values():
for url in extract_test(extractor):
domain = extract_domain(url)
if domain in DOMAIN_DICT:
DOMAIN_DICT[domain] = DOMAIN_DICT[domain] + [extractor]
else:
DOMAIN_DICT[domain] = [extractor]

def is_link_suitable(link, extractors):
"""Check if link is valid given an extractor."""
return any([ie.suitable(link) for ie in extractors])

def is_link_valid(link, domain_dict):
"""Check if link is valid given a list of extractors."""
is_valid = False
domain = extract_domain(link)
if domain in domain_dict:
is_valid = is_link_suitable(link, domain_dict[domain])
return is_valid

def valid_video_platform_link(link):
"""Check if link is a valid video platform link."""
return link and is_link_valid(link, DOMAIN_DICT)

YT_URL = "https://www.youtube.com/watch?v=jLX0D8qQUBM"
DM_URL = "https://www.dailymotion.com/video/x29ryo7"
DM_URL2 = "https://geo.dailymotion.com/player.html?video=x89eyek&mute=true"
print(valid_video_platform_link(YT_URL))
print(valid_video_platform_link(DM_URL))
print(valid_video_platform_link(DM_URL2))

import time

start_time = time.time()
[valid_video_platform_link(x) for x in [DM_URL] * 1000]
print("--- %s seconds ---" % (time.time() - start_time))

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I put the code there #49

it's faster indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting for user input
Development

Successfully merging this pull request may close these issues.

2 participants