Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate Feature #797

Open
dslovin opened this issue Sep 14, 2020 · 12 comments
Open

Deduplicate Feature #797

dslovin opened this issue Sep 14, 2020 · 12 comments
Labels

Comments

@dslovin
Copy link

dslovin commented Sep 14, 2020

Occasionally, I get duplicate entries of the same feed due reading a feed at the source as well as an aggregator like hackernews. I would love to be able to dedupe based on the following fields:

  1. Link
  2. Title
  3. (bonus) Similar titles

(edit for spelling)

@moonheart
Copy link

Some sites have original posts news and copied news, when I subscribed these feeds, I always saw duplicate articles in several feeds . I hope to deduplicate similar entries in multiple or all deeds.

When adding a new entry, miniflux checks recent old entries and calculate similarity, if there is an entry reached the configured threshold, the new entry is marked removed or read.

For similarity calculation, maybe we can first split words and use Cosine similarity, or simply use equals. Users can configure how to calculate similarity, title or content.

@somini
Copy link
Contributor

somini commented Feb 13, 2022

Hopefully the "Mark as Read" option is available. That's what I manually do anyway.

@nblock
Copy link

nblock commented Jun 16, 2022

Since Miniflux relies on PostgreSQL, maybe something like the pg_trgm extension is useful: https://www.postgresql.org/docs/current/pgtrgm.html?

@ajtatum
Copy link

ajtatum commented Aug 2, 2022

This would be an awesome feature. A lot of times a writer writes for his own blog and then reposts somewhere else, but it's in the same category of Miniflux with the same title. If it were to remove either entry (preferably keeping the first), then that would be awesome.

@Sieboldianus
Copy link

Sieboldianus commented Feb 11, 2023

Came here with a slightly different (but related) problem: Some of my feeds - largely big newspapers, re-publish the same articles over time. This particularly applies to essays, I think they want to push it a number of times so their website appears "more active", without adding any new information. But it is frustrating to see the same posts popping up again and again, it is wasting my time.

I was wondering whether a deduplication feature could also have some temporal comparison check such as "The same article heading was published 1 month ago, 2 years ago etc." to then get hidden from standard view.

Functional wise, it would be pretty similar: One needs a persistent table with headings (and timestamps) to check against in Postgres.

@sonor3000
Copy link

I don't know how it is developed, but ttrss has such a deduplicate feature. Maybe it can help to develope such a ffeature for Miniflux too!

@tagd
Copy link

tagd commented Jun 20, 2024

As a workaround, I created a python script using the API to check for repeated URLs and remove duplicates, borrowing from a similar solution, that can be run as a cronjob on your miniflux server.

I might look into triggering it with webhooks in the future, so if anyone works that out please comment the details. Or if you tidy up the script I'd be interested to see as I don't use python often.

Note if using the get_feeds_w_dupes function adding RemoveDuplicates to a feeds blocklist will add it to future scans, if you don't want to use this function remove_duplicates can be run with a list of feed id's like remove_duplicates([1, 2])

Also if you want to also check titles, just create another set to aggregate them with the entry["title"] attribute.

# Licensed under MIT license
# Link to discussion: https://github.com/miniflux/v2/issues/797

# Steps to use:
# - Install the Miniflux python client: https://miniflux.app/docs/api.html#python-client
# - Replace rss.example.com with your Miniflux instance url
# - Replace API_KEY with your Miniflux api key which you can get here https://rss.example.com/keys

import re
import miniflux

# Behaviour: Keep first instance based on url, mark subsequent as removed
# Drawbacks: 
#    - Newer versions of article may have updates, 
#       but keeping these would lose read status and other attributes
#    - Repeat article could be so old as to have already been removed,
#       if it's been so long I assume changes might be worth reading
#    - The oldest one might not've been the one read
# Removal process: https://miniflux.app/faq.html#entries-suppression
def remove_duplicates(feed_ids): ## 
    dupe_ids = []
    for feed_id in feed_ids:
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        seen_urls = set()
        for entry in entries["entries"]:
            if (entry["url"] in seen_urls):
                dupe_ids.append(entry["id"])
                #print("Duplicate found " + entry["title"])
            else:
                seen_urls.add(entry["url"])
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="removed")

# Get a list of feeds to check for duplicates based on blocklist text.
# To check a feed add "RemoveDuplicates" to the Blocklist_rules box
def get_feeds_w_dupes(all_feeds):
    feeds_ids = []
    for feed in all_feeds:
        if ("RemoveDuplicates" in feed["blocklist_rules"]):
            feeds_ids.append(feed["id"])
            #print("Feed " + str(feed["id"])+ " has rule RemoveDuplicates")
    return feeds_ids

client = miniflux.Client("https://rss.example.com", api_key="API_KEY")

all_feeds = client.get_feeds()

feeds_w_dupes = get_feeds_w_dupes(all_feeds)
remove_duplicates(feeds_w_dupes)

@Sieboldianus
Copy link

Sieboldianus commented Jun 21, 2024

Great! But this only works on a single URL basis. In case the URL changed, but the text remained the same (with e.g. slighlty changed title), this would not be detected (I am not complaining - this is much better than nothing. Thank you so much!). There are a lot of feeds from newspapers that re-publish entries under slightly changed title every x days. Some kind of semantic similarity would be needed to catch these..

@tagd
Copy link

tagd commented Jun 25, 2024

Great! But this only works on a single URL basis. In case the URL changed, but the text remained the same (with e.g. slighlty changed title), this would not be detected (I am not complaining - this is much better than nothing. Thank you so much!). There are a lot of feeds from newspapers that re-publish entries under slightly changed title every x days. Some kind of semantic similarity would be needed to catch these..

@Sieboldianus Happy to help! Here's an edit of the main function to pickup on same titles, I'd also suggest checking if the articles keep some other attribute that can be detected, like "published_at" which will be a string like "2024-06-10T15:53:17+01:00", so should be fairly unique.

def remove_duplicates(feed_ids):
    dupe_ids = []
    for feed_id in feed_ids:
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        seen_urls = set()
        seen_titles = set()
        for entry in entries["entries"]:
            if ((entry["url"] in seen_urls) or (entry["title"] in seen_titles)):
                dupe_ids.append(entry["id"])
                #print("Duplicate found " + entry["title"])
            else:
                seen_urls.add(entry["url"])
                seen_titles.add(entry["title"])
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="removed")

For fuzzy matching I wrote this function, I've checked the matching works, but didn't have any feeds that rename titles, so haven't validated it against real data, if you could post examples feeds that would be helpful, thanks.

import re
import miniflux
from fuzzywuzzy import fuzz # pip install fuzzywuzzy
                            # Compare strings with https://en.wikipedia.org/wiki/Levenshtein_distance 
import nltk                 # pip install nltk
from nltk.corpus import stopwords # list of words to ignore
nltk.download('stopwords')  # Download stopwords (only needed on first run)

def read_duplicates(feed_ids, sensitivity = 85):
    """
    This function marks articles as read if they are a duplicate of a previously
    seen article, based on similarity of the title.

    Similarity computed by:
    Take the title, remove all punctuation, stop words (words like 'a', 'and', 
        'the', etc) and set to lower case.
    This produces a string of the important words in the title.  
    If the article is read store this processsed title in a list for comparison.
    If unread check its not already in the list if so mark read
        If not check similar based on Levenshtein distance(LD)
            words are sorted to alphabetical order before computing LD

    Args:
      feed_ids: The ids of feeds to check, like ["1", "2"]
      sensitivity: How similar two string must be to match

    """
    dupe_ids = []
    stop_words = set(stopwords.words('english')) # words to ignore
    for feed_id in feed_ids:
        seen_titles = set()
        entries = client.get_feed_entries(feed_id=feed_id, order="id", 
            direction="asc", status=["read","unread"])
        """ initially assumed we could see all read then check unread, 
            but then if we see titles like 
                '1.bad thing happens, 2.things bad happens'
            when 2. is marked read for being similar to 1. next run of 
            program first sees 2. in read, then sees 1. which is similar
            so marks 1. as read, even though neither was ever read. 
            So we have to go through by id then check read status
        """
        for entry in entries["entries"]:
            # processed title is title without joining words and lowercase
            processed_title = re.sub(r'\W+', ' ', entry["title"]) # replace non alphanumeric chars with space
            processed_title = ' '.join([word for word in processed_title.lower().split() 
                                   if word not in stop_words])
            if (entry["status"]=="read"):
                seen_titles.add(processed_title)
            else: # only check if unread mark
                if (processed_title in seen_titles):
                    print("Duplicate found "+ processed_title)
                    dupe_ids.append(entry["id"])
                else:
                    for title in seen_titles:
                        print("checking: '"+ processed_title + "' against '" + title + "'")
                        # 1-100, higher = closer matching
                        # token sort will sort words into order before comparing titles
                        if (fuzz.token_sort_ratio(processed_title, title) > sensitivity):
                            dupe_ids.append(entry["id"])
                            seen_titles.add(processed_title) 
                            ''' See matches too; so one article may have multiple seen titles 
                                and we can match between them eg
                                    The white whale > Big white whale > Big whale
                                where the initial title may not match the later title but 
                                by keeping the intermediate we recognise the final version
                            '''
                            break # match found stop checking
                    seen_titles.add(processed_title) 
    if dupe_ids: # Repeats found
        client.update_entries(dupe_ids, status="read")

@Sieboldianus
Copy link

Nice, thank you! I will definitely test it.

@didn0t
Copy link

didn0t commented Jul 17, 2024

As a workaround, I created a python script using the API to check for repeated URLs and remove duplicates, borrowing from a similar solution, that can be run as a cronjob on your miniflux server.

Thanks for your script. I did notice that I had to add a limit parameter to client.get_feed_entries as it defaults to 100.

@darkdragon-001
Copy link
Contributor

darkdragon-001 commented Sep 22, 2024

One big change would be to change the 1-to-1 relation from entry to feed to a 1-to-n relation such that the same entry can be part of multiple feeds (deduplication). As Miniflux is not using an ORM, this requires changes around the whole code base.

Update: Miniflux already supports storing an entry's hash based on the GUID or otherwise uses the entry URL:

// Generate the entry hash.
if item.GUID.Data != "" {
entry.Hash = crypto.Hash(item.GUID.Data)
} else if entryURL != "" {
entry.Hash = crypto.Hash(entryURL)
}

So it should be possible to implement deduplication on this criteria efficiently.
The entry's content should still be updated when it changes.

There is already deduplication within a single feed based on this hash:

err := tx.QueryRow(`SELECT true FROM entries WHERE feed_id=$1 AND hash=$2`, entry.FeedID, entry.Hash).Scan(&result)

So one probably "just" has to remove the feed filter (and add a user filter instead). The most work would be around changing the schema relation and all usages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests