Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative scheduler #643

Closed
shizunge opened this issue Apr 21, 2020 · 3 comments
Closed

alternative scheduler #643

shizunge opened this issue Apr 21, 2020 · 3 comments

Comments

@shizunge
Copy link
Contributor

shizunge commented Apr 21, 2020

RSS is a good example of long tail. A few feeds update many times an hour, while many personal blogs update only a few times per month.

Miniflux today uses a round-robin scheduler. All feeds are fetched at the same frequency. This does not accommodate long tail of RSS feeds.

I propose the following flow, let miniflux to fetch feeds based on their updating frequency.

  1. introduce three config flags / environment variables:
    POLLING_SCHEDULER ->
    "ROUND_ROBIN": The default scheduler
    "INVERSE_COUNT": This scheduler sets the polling frequency based on the number of articles published in the previous week. This scheduler increases the polling frequency of more active feeds, while decrease the polling frequency of less active feeds. The maximum number of polling is still subject to "POLLING_FREQUENCY" and "BATCH_SIZE". If you have many feeds that do not update often, this scheduler will decrease the total number of polling, at a cost of larger latency of less active feeds.
    If no valid value provided, the default scheduler "ROUND_ROBIN" will be used.
    SCHEDULER_INVERSE_COUNT_MIN_INTERVAL -> default 5 minutes
    SCHEDULER_INVERSE_COUNT_MAX_INTERVAL -> default 24 hours

  2. update database scheme, add a new column "next_check_at" to the feed table. default to now()

  3. In "func (s *Storage) NewBatch(batchSize int) (jobs model.JobList, err error)":

Query ordered by "next_check_at" instead of "last_checked_at",
and the "next_check_at" must be smaller than now(). i.e. it must be expired, not in the future.

The total number of feeds fetching is still subject to "POLLING_FREQUENCY" and "BATCH_SIZE"

  1. in "func (h *Handler) RefreshFeed(userID, feedID int64)":

4.1. calling h.store.FeedByID(userID, feedID) returns the entries count in past 7 days, including "removed" items.

4.2.
If using "ROUND_ROBIN", the interval is always POLLING_FREQUNECY

If using "INVERSE_COUNT", calculate the average interval between two updates, based on the entries count and the flag in 1.

Set the "next_check_at" as now()+interval.

4.3. when calling h.store.UpdateFeed(originalFeed), update "next_check_at"

I can work on this but want to hear your opinion firstly.

@shizunge shizunge changed the title heuristic scheduler alternative scheduler Apr 22, 2020
@fguillot
Copy link
Member

That sounds good. You can submit a PR if you have time.

@pdewacht
Copy link
Contributor

pdewacht commented May 7, 2020

For my own use, I added a hack to enforce a minimum duration between checks for rarely-updated feeds. pdewacht@1d8ed6d
I didn't submit a PR, because it's just too ugly and adhoc, but it works well for my purposes.

@fguillot
Copy link
Member

PR #646 has been merged. I renamed this alternative scheduler to entry_frequency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants