Poll jobs by comparing both priority and insert_time #344

my8100 · 2019-06-25T10:20:28Z

This PR use project_priority_map to store (priority, -timestamp) as value,
in order to find out the queue to pop.

Fix #187 (the updated test_poll_next() demonstrates the effect).
Also, provide backward compatibility for custom SqliteSpiderQueue
and JsonSqlitePriorityQueue.

codecov · 2019-06-25T12:35:39Z

Codecov Report

Merging #344 into master will decrease coverage by 0.02%.
The diff coverage is 67.64%.

@@            Coverage Diff             @@
##           master     #344      +/-   ##
==========================================
- Coverage   68.37%   68.35%   -0.03%     
==========================================
  Files          17       17              
  Lines         860      891      +31     
  Branches      104      112       +8     
==========================================
+ Hits          588      609      +21     
- Misses        242      250       +8     
- Partials       30       32       +2

Impacted Files	Coverage Δ
scrapyd/sqlite.py	`89.28% <100%> (+0.39%)`	⬆️
scrapyd/spiderqueue.py	`95.65% <100%> (+0.41%)`	⬆️
scrapyd/poller.py	`74.07% <60.71%> (-12.14%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ff7c1c...f017268. Read the comment docs.

codecov · 2019-06-25T12:35:39Z

Codecov Report

Merging #344 into master will increase coverage by 0.26%.
The diff coverage is 73.33%.

@@            Coverage Diff             @@
##           master     #344      +/-   ##
==========================================
+ Coverage   68.37%   68.63%   +0.26%     
==========================================
  Files          17       17              
  Lines         860      915      +55     
  Branches      104      117      +13     
==========================================
+ Hits          588      628      +40     
- Misses        242      252      +10     
- Partials       30       35       +5

Impacted Files	Coverage Δ
scrapyd/spiderqueue.py	`96% <100%> (+0.76%)`	⬆️
scrapyd/poller.py	`75% <62.96%> (-11.21%)`	⬇️
scrapyd/sqlite.py	`86.76% <79.31%> (-2.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4520c45...1a0cb2b. Read the comment docs.

my8100 · 2019-06-26T05:12:38Z

@Digenis
How do you think about this?

Digenis · 2019-06-26T05:43:59Z

Your solution implements the second idea in #187.

Your code reveals that the second idea has another drawback.
A for-loop in the poller, querying all queues,
if more processes write/read queues,
will cause race conditions to occur more often.

Digenis · 2019-06-26T05:44:51Z

The compatibility attribute trick was a nice idea by the way

my8100 · 2019-06-26T05:49:26Z

But the original code also iterates the queues,
and if msg is not None: can handle concurrency problem,
so can the new code.

scrapyd/scrapyd/poller.py

Lines 20 to 26 in 4520c45

    
           for p, q in iteritems(self.queues): 
        
               c = yield maybeDeferred(q.count) 
        
               if c: 
        
                   msg = yield maybeDeferred(q.pop) 
        
                   if msg is not None:  # In case of a concurrently accessed queue 
        
                       returnValue(self.dq.put(self._message(msg, p)))

my8100 · 2019-06-26T05:59:29Z

FYI: #198

Digenis · 2019-06-26T07:03:22Z

But the original code also iterates the queues,

It was stopping on the first non-empty queue.

Unless I misunderstand returnValue(), maybe I do.
Here's what I think it does.
msg = yield maybeDeferred(q.pop) gives control to twisted to call q.pop.
Then twisted sends the result back to the function, which stores it to msg resuming execution.
yield/send, although used here only for giving/receiving control of the execution,
actually make the function a generator, so we can't use a return statement for our result.
As a workaround, twisted has returnValue() which raises an exception to stop there.
Does execution continue after returnValue() is called?

my8100 · 2019-06-26T08:28:30Z

poll() ends when returnValue() is called.

Can you explain ‘race conditions‘ in details, maybe with an example?

my8100 · 2019-06-26T15:53:48Z

How about the third commit?
I save the highest priority of each queue in the queue_priority_map,
and get it updated whenever needed,
so that poll() can figure out the highest priority of all queues instantly.

Digenis · 2019-06-26T18:20:02Z

A multiple queue consumer scenario.
Suppose a slow consumer iterates all the queues
and collects the top job of each queue.
Meanwhile, a fast consumer does the same
and also pops the job with the highest priority among all queues.
The slow consumer will then get an empty result when trying to pop the same job
and will start over.
This things are more likely to happen
if the process of finding the top job of all queues takes more time / cpu circles.

Digenis · 2019-06-26T20:25:07Z

Also, I just realized that the FIFO principal is still violated.

A limit of 2 parallel processes.
2 empty fresh queues.

On hour 13:00 projectA gets 5 jobs scheduled, with queueA ids (in sqlite) 1,2,3,4,5

By hour 13:05, projectA's job with queueA id 1 has finished and 2, 3 are running
while projectB gets 5 jobs scheduled, with queueB id 1,2,3,4,5

By hour 13:10, projectA's job with queueA id 2 finishes
and projectB job with queueB id 1 is starting.

projectA job with queue id 3, although scheduled at 13:00,
will have to wait for projectB queue id 1 and 2,
which were scheduled 10 minutes later

Of course this is not exactly the current behaviour of your branch
but may have been your next commit — selecting & sorting ids too.
Currently, the priority scope is global
but same priorities revert back to the old behaviour.

The in-queue id defines a priority too — the insertion order.
However, we can't compare insertion orders of different queues.
If we did, we would also end up with "arbitrary" order.
(Not exactly arbitrary but I won't open a parentheses
because it's not a useful scheduling policy in this context)

Digenis · 2019-06-26T21:02:51Z

So, how do we solve this while keeping the multiple queue approach?
To make things easy, suppose all queues were different tables in the same db.

SELECT 'projectA', id FROM queueA
UNION ALL
SELECT 'projectB', id FROM queueB
ORDER BY priority DESC -- then what column?
-- the information about the global insertion order was never stored
LIMIT 1;

At this point, unless you have a business-critical non-upgradeable legacy system
you probably give up any attempt to do a backwards compatible fix
that'd go in 1.3.
It's dirty fixes all the way from here:

In the poller, an equivalent of ORDER BY RANDOM(), priority DESC
(which I just decided to name random-robin priority queues).

In all queues, a new datatime column, saving the insertion time,
then ORDER BY priority DESC, insert_time ASC

A master queue saving a global insertion order

CREATE TABLE queue (id INTEGER PRIMARY KEY,
                    project TEXT NOT NULL,
                    foreign_id INTEGER NOT NULL);

which is close to the unified queue idea,
only more complicated implementing all the rest simultaneously.

Of all the desperate attempts to keep backwards compatibility,
the one with the insertion time is the most acceptable,
although of debatable compatibility.
It even qualifies as something to test a unified queue against.
Multiple tables + datetime definitely add a probability of error
but FIFO will be violated only in sub-second resolutions
and bothering about this is like bothering about the time it takes
for a scheduling request to cross the network.
I'd still like to see the queue unified,
even in a single-db multiple tables schema.
Let's discuss and test the datetime trick against the single table approach.

my8100 · 2019-06-27T10:24:41Z

A multiple queue consumer scenario.
Suppose a slow consumer iterates all the queues

So, if cancel.json is called while poll() is iterating all the queues with yield maybeDeferred,
chances are that msg = yield maybeDeferred(q.pop) may return None?

            for p, q in iteritems(self.queues):
                c = yield maybeDeferred(q.count)
                if c:
                    msg = yield maybeDeferred(q.pop)
                    if msg is not None:  # In case of a concurrently accessed queue
                        returnValue(self.dq.put(self._message(msg, p)))

my8100 · 2019-06-27T10:40:58Z

Also, I just realized that the FIFO principal is still violated.

A limit of 2 parallel processes.
2 empty fresh queues.

On hour 13:00 projectA gets 5 jobs scheduled, with queueA ids (in sqlite) 1,2,3,4,5

By hour 13:05, projectA's job with queueA id 1 has finished and 2, 3 are running
while projectB gets 5 jobs scheduled, with queueB id 1,2,3,4,5

By hour 13:10, projectA's job with queueA id 2 finishes
and projectB job with queueB id 1 is starting.

projectA job with queue id 3, although scheduled at 13:00,
will have to wait for projectB queue id 1 and 2,
which were scheduled 10 minutes later

What are the names of these projects and the priorities of each job?

my8100 · 2019-06-27T11:12:23Z

Even though for p, q in iteritems(self.queues): would violate FIFO when polling multiple projects,
we can still use the priority parameter to adjust the polling order
as long as we figure out the project with the highest priority of all queues.

Maybe we can/should make it in v1.3.0, which provides backward compatibility.

Check out my fourth commit, which saves the highest priority of each project in project_priority_map,
so that there's no need to iterate all the queues.

my8100 · 2019-06-27T15:26:53Z

Changes in the fifth commit:

A new column named 'insert_time' is add when creating tables in JsonSqlitePriorityQueue:
insert_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP
The table name in SqliteSpiderQueue now defaults to 'spider_queue_insert_time',
in order to avoid such error when selecting the non-existing 'insert_time' column in the existing db file:
sqlite3.OperationalError: no such column: insert_time
project_priority_map now stores (priority, -timestamp) as value
so that the priority is taken first and the FIFO principle is also respected. (see test_poll_next())
{'project1': (0.0, -1561646348), 'project2': (1.0, -1561646349)}
Backward compatibility is still provided.

Digenis · 2019-06-27T19:26:48Z

Even if it's backwards compatible with user code, it's not with user data in the queue.

If we break compatibility, it better be once and better be worthy.

But I also don't want to let users fix it on their own.
It's annoying because most of their work would be boilerplate stuff.
Let's do something similar to what scrapy did with scrapy.contrib.
We can add a sub-package with these sample fixes
and users would only need to change configuration.
Any idea for a name?

my8100 · 2019-06-27T20:22:27Z

Even if it's backwards compatible with user code,
it's not with user data in the queue.

But it only ignores the pending jobs on the first startup,
which may not exist at all for most cases.

Any idea for a name?

Simply use the same name ‘contrib’?
(I still don’t know how it works yet)
Or are you asking for a name of the new configuration option?

my8100 · 2019-06-27T20:32:27Z

What about inserting a new column into the existing table via ALTER TABLE?

my8100 · 2019-06-28T07:15:53Z

The 6th commit introduces the ensure_insert_time_column method
to add a new column named 'insert_time'
into the table of the existing db files if needed.
Also, fill in the 'insert_time' column for any pending jobs inside the table.

my8100 · 2019-06-30T11:22:18Z

@Digenis
How about the 6th commit?

Digenis · 2019-07-01T05:49:34Z

I'm a bit busy right now

I think I'll be able to review it in the next 2 days

Digenis · 2019-07-02T20:00:42Z

scrapyd/sqlite.py

@@ -144,6 +147,24 @@ def clear(self):
        self.conn.commit()
        self.update_project_priority_map()

+    def ensure_insert_time_column(self):
+        q = "SELECT insert_time FROM %s LIMIT 1" % self.table


There's also:

SELECT sql FROM sqlite_master WHERE type='table' AND name='spider_queue'; -- ⇒ CREATE TABLE spider_queue (id integer primary key, priority real key, message blob)

but is it any better?

I didn't know about 'sqlite_master' before.
The ensure_insert_time_column method is updated in the 7th commit.

Digenis · 2019-07-02T20:20:28Z

scrapyd/sqlite.py

@@ -82,7 +83,7 @@ class JsonSqlitePriorityQueue(object):
    """SQLite priority queue. It relies on SQLite concurrency support for
    providing atomic inter-process operations.
    """
-    queue_priority_map = {}
+    project_priority_map = {}


This is like "master table" solution I was talking about
actually implemented as a singleton to share state between all instances of the queue class
and save us lot of io and cpu cycles,
right?

It's fast to find out the queue to pop in poll()
since project_priority_map is a dict like:
{'project1': (0.0, -1561646348), 'project2': (1.0, -1561646349)}

So, there's no need to introduce an actual "master table".

Digenis · 2019-07-02T20:25:05Z

scrapyd/sqlite.py

@@ -131,6 +131,12 @@ def clear(self):
        self.conn.execute("delete from %s" % self.table)
        self.conn.commit()

+    def get_highest_priority(self):
+        q = "select priority from %s order by priority desc limit 1" \


but since additional selects are ran for almost all queue method calls
we are not saving cpu cycles or io

I introduce SQLite triggers in the 8th commit,
in which project_priority_map would be updated
whenever an INSERT/UPDATE/DELETE occurs.

This is intended to speed up poll() and avoid race conditions,
rather than to save CPU cycles or io.
I think the cost is acceptable.

Digenis · 2019-07-02T20:33:23Z

Comments are to be red in chronological order (not commit order).

I feel reluctant to even make comments.
You do understand that this will most probably only be a workaround, right?
In a contrib/ subpackage, I wouldn't mind merging it even in 1.2
but in 1.3 I'd rather see FIFO queues taking turns in round robin
and fixing the strategy in 1.4
instead of making code with so many workarounds the default in 1.3.

Remove unnecessary getattr statement Add queue_priority_map to save highest priority Save the highest priority of each project in project_priority_map project_priority_map stores (priority, -timestamp) as value Add ensure_insert_time_column() Query sqlite_master in ensure_insert_time_column() Introduce create_triggers()

codecov-io · 2019-07-03T15:03:09Z

Codecov Report

Merging #344 into master will increase coverage by 0.61%.
The diff coverage is 80.76%.

@@            Coverage Diff             @@
##           master     #344      +/-   ##
==========================================
+ Coverage   68.37%   68.98%   +0.61%     
==========================================
  Files          17       17              
  Lines         860      906      +46     
  Branches      104      116      +12     
==========================================
+ Hits          588      625      +37     
- Misses        242      247       +5     
- Partials       30       34       +4

Impacted Files	Coverage Δ
scrapyd/sqlite.py	`90.55% <100%> (+1.66%)`	⬆️
scrapyd/spiderqueue.py	`96% <100%> (+0.76%)`	⬆️
scrapyd/poller.py	`75% <62.96%> (-11.21%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ff7c1c...53a619b. Read the comment docs.

my8100 · 2019-07-03T15:31:21Z

I feel reluctant to even make comments.

If so, how can we go on discussing?

You do understand that this will most probably only be a workaround, right?

No, this PR aims to fix #187 in a creative and effective way:

The priority of a job is taken first and the FIFO principle is also respected for multiple projects,
and the effect is verified in the new test_poll_next().
There's no delay in poll() as project_priority_map is updated by SQLite triggers.
No change to the current 'multiple queues' implementation.
Backward compatibility for custom modules is provided.
Backward compatibility for user data is provided.
It's easy to remove the workarounds in v1.4.

In a contrib/ subpackage, I wouldn't mind merging it even in 1.2
but in 1.3 I'd rather see FIFO queues taking turns in round robin
and fixing the strategy in 1.4
instead of making code with so many workarounds the default in 1.3.

I think we should fix #187 in v1.3 as the 'priority' parameter is exposed in PR #161,
and merge PR #343 in v1.4 while removing backward compatibility.

my8100 · 2019-07-04T04:24:44Z

The queue table name is changed from 'spider_queue' to 'spider_queue_with_triggers'.

This PR introduces SQLite triggers which are stored in the database,
Users would encounter a problem if they downgrade Scrapyd from v1.3 to v1.2:
sqlite3.OperationalError: no such function: update_project_priority_map

Renaming the table name ensures upgrade and downgrade work well anytime,
and the pending jobs could be retrieved if needed.

jpmckinney · 2024-07-18T22:47:56Z

As Digenis wrote, if we try for backwards compatibility, then "It's dirty fixes all the way from here"

This solution is very clever, but probably too clever for a project that only gets maintainer attention every few years. It takes too long to understand how it works and how to modify it – compared to a simpler solution (that breaks compatibility).

So, I'll close this PR, though the test updates might be useful in future work.

Edit: Unrelated to closure, but I think the last commit breaks backwards compatibility for user data (code starts using a new table without migrating any of the data from the old table).

Digenis added type: bug type: discussion type: enhancement and removed type: enhancement labels Jun 26, 2019

my8100 changed the title ~~Fix polling order by comparing priority of all queues~~ Poll jobs by comparing both priority and insert_time Jun 28, 2019

my8100 requested a review from Digenis June 29, 2019 07:55

Digenis reviewed Jul 2, 2019

View reviewed changes

my8100 force-pushed the poll_with_priority branch from 200066b to 1a0cb2b Compare July 3, 2019 14:35

Rename queue table for upgrade and downgrade need

53a619b

my8100 mentioned this pull request Jul 18, 2019

Fix polling order in contrib modules #349

Closed

Digenis added the topic: scheduling label Apr 13, 2021

jpmckinney mentioned this pull request Jul 15, 2024

Use a single spider queue, for deterministic priority across all projects #187

Open

jpmckinney modified the milestone: Priority Jul 17, 2024

jpmckinney closed this Jul 18, 2024

jpmckinney added pr: abandoned for unmerged PRs that were abandoned status: invalid not a Scrapyd issue, an accidental PR or an improper approach and removed type: bug type: discussion topic: scheduling pr: abandoned for unmerged PRs that were abandoned labels Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poll jobs by comparing both priority and insert_time #344

Poll jobs by comparing both priority and insert_time #344

my8100 commented Jun 25, 2019 •

edited

Loading

codecov bot commented Jun 25, 2019

codecov bot commented Jun 25, 2019 •

edited

Loading

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019

Digenis commented Jun 26, 2019

my8100 commented Jun 26, 2019 •

edited

Loading

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019 •

edited

Loading

my8100 commented Jun 26, 2019

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

my8100 commented Jun 27, 2019

my8100 commented Jun 27, 2019

my8100 commented Jun 27, 2019 •

edited

Loading

my8100 commented Jun 27, 2019 •

edited

Loading

Digenis commented Jun 27, 2019 •

edited

Loading

my8100 commented Jun 27, 2019 •

edited

Loading

my8100 commented Jun 27, 2019

my8100 commented Jun 28, 2019

my8100 commented Jun 30, 2019

Digenis commented Jul 1, 2019

Digenis Jul 2, 2019

my8100 Jul 3, 2019

Digenis Jul 2, 2019

my8100 Jul 3, 2019 •

edited

Loading

Digenis Jul 2, 2019

my8100 Jul 3, 2019

Digenis commented Jul 2, 2019

codecov-io commented Jul 3, 2019 •

edited by codecov bot

Loading

my8100 commented Jul 3, 2019 •

edited

Loading

my8100 commented Jul 4, 2019

jpmckinney commented Jul 18, 2024 •

edited

Loading

Poll jobs by comparing both priority and insert_time #344

Poll jobs by comparing both priority and insert_time #344

Conversation

my8100 commented Jun 25, 2019 • edited Loading

codecov bot commented Jun 25, 2019

Codecov Report

codecov bot commented Jun 25, 2019 • edited Loading

Codecov Report

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019

Digenis commented Jun 26, 2019

my8100 commented Jun 26, 2019 • edited Loading

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019 • edited Loading

my8100 commented Jun 26, 2019

my8100 commented Jun 26, 2019

Digenis commented Jun 26, 2019 • edited Loading

Digenis commented Jun 26, 2019 • edited Loading

Digenis commented Jun 26, 2019 • edited Loading

my8100 commented Jun 27, 2019

my8100 commented Jun 27, 2019

my8100 commented Jun 27, 2019 • edited Loading

my8100 commented Jun 27, 2019 • edited Loading

Digenis commented Jun 27, 2019 • edited Loading

my8100 commented Jun 27, 2019 • edited Loading

my8100 commented Jun 27, 2019

my8100 commented Jun 28, 2019

my8100 commented Jun 30, 2019

Digenis commented Jul 1, 2019

Digenis Jul 2, 2019

Choose a reason for hiding this comment

my8100 Jul 3, 2019

Choose a reason for hiding this comment

Digenis Jul 2, 2019

Choose a reason for hiding this comment

my8100 Jul 3, 2019 • edited Loading

Choose a reason for hiding this comment

Digenis Jul 2, 2019

Choose a reason for hiding this comment

my8100 Jul 3, 2019

Choose a reason for hiding this comment

Digenis commented Jul 2, 2019

codecov-io commented Jul 3, 2019 • edited by codecov bot Loading

Codecov Report

my8100 commented Jul 3, 2019 • edited Loading

my8100 commented Jul 4, 2019

jpmckinney commented Jul 18, 2024 • edited Loading

my8100 commented Jun 25, 2019 •

edited

Loading

codecov bot commented Jun 25, 2019 •

edited

Loading

my8100 commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

Digenis commented Jun 26, 2019 •

edited

Loading

my8100 commented Jun 27, 2019 •

edited

Loading

my8100 commented Jun 27, 2019 •

edited

Loading

Digenis commented Jun 27, 2019 •

edited

Loading

my8100 commented Jun 27, 2019 •

edited

Loading

my8100 Jul 3, 2019 •

edited

Loading

codecov-io commented Jul 3, 2019 •

edited by codecov bot

Loading

my8100 commented Jul 3, 2019 •

edited

Loading

jpmckinney commented Jul 18, 2024 •

edited

Loading