Add function to detect non-news URLs? #91

philbudne · 2024-10-04T18:25:33Z

Both rss-fetcher and story-indexer contain tests for non-news URLs based on the NON_NEWS_DOMAINS list from urls.py

rss-fetcher uses:

tasks.py:            if s.domain in mcmetadata.urls.NON_NEWS_DOMAINS:

which only catches cases where the fully qualified domain name (FQDN) is EXACTLY what appears in NON_NEWS_DOMAINS, while story-indexer has a utility function that also matches anything INSIDE the embargoed domains:

def non_news_fqdn(fqdn: str) -> bool:
    """
    check if a FQDN (fully qualified domain name, ie; DNS name)
    is (in) a domain embargoed as "non-news"

    maybe belongs in  mcmetadata??
    """
    # could be written as "any" on a comprehension:
    # looks like that's 15% slower in Python 3.10,
    # and harder to for me to... comprehend!
    fqdn = fqdn.lower()
    for nnd in NON_NEWS_DOMAINS:
        if fqdn == nnd or fqdn.endswith("." + nnd):
            return True
    return False

I'd like to be able to use this function in rss-fetcher!

NOTE: this code assumes NON_NEWS_DOMAINS is all lower case which is currently.... the case, but that is not enforced/guaranteed, so maybe that could be added as well?!

The text was updated successfully, but these errors were encountered:

philbudne mentioned this issue Oct 4, 2024

Investigate rss-fetcher returning non-news URLs mediacloud/rss-fetcher#44

Open

philbudne mentioned this issue Oct 5, 2024

Add function urls.is_non_news_domain #93

Merged

pgulley closed this as completed in #93 Oct 8, 2024

pgulley closed this as completed in cd548eb Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to detect non-news URLs? #91

Add function to detect non-news URLs? #91

philbudne commented Oct 4, 2024

Add function to detect non-news URLs? #91

Add function to detect non-news URLs? #91

Comments

philbudne commented Oct 4, 2024