feat: creating multiple unnamed queues #2752

barjin · 2024-11-22T14:05:02Z

There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having CheerioCrawler for processing most of the pages and a separate keep-alive PlaywrightCrawler instance for processing some specific pages the first crawler finds).

Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g., RequestQueue.open('playwrightQueue')).

The named queues, however, don't get purged with a new script run, so in any subsequent run, the PlaywrightCrawler might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script with rm -rf ./storage && npm start, or similar "hacks".

// open a secondary queue
const secondaryRQ = await RequestQueue.open('Bqueue');
  
const crawlerA = new CheerioCrawler({
  // use the default queue with crawlerA, and add requests to the secondary queue
  requestHandler: async ({ request }) => {
    console.log(`[A] ${request.url}`);

    await secondaryRQ.addRequest({ url: request.url });
  }
});

const crawlerB = new CheerioCrawler({
  // consume the secondary queue
  requestQueue: secondaryRQ,
  requestHandler: ({ request }) => {
    console.log(`[B] ${request.url}`);
  },
});

await crawlerA.run(['http://example.com']);
await crawlerB.run();

Repeated runs yield different results:

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
[B] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
INFO  CheerioCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.

The text was updated successfully, but these errors were encountered:

janbuchar · 2024-11-25T08:41:44Z

Thanks for opening this! We also talked about how we need to remember IDs of non-default unnamed queues between migrations.

My first idea would be an API like await RequestQueue.openTemporary("some-name"). On memory storage, we'd simply map this to storage/request_queues/__tmp_some-name, for example, and we'd remove this on start just like we do with default. On Apify, we'd have to keep a mapping of storage name => storage id in the key-value store to preserve the storages. Apart from that, it should make no difference.

The same could apply to all three storage types, not just request queues.

barjin · 2024-11-25T09:33:56Z

Thanks for the points!

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

janbuchar · 2024-11-25T10:30:25Z

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

Yeah, the storage thing is pretty confusing as a whole. To boot, the name "default" has special meaning in memory storage (which is filesystem-backed, obviously), but not on Apify. So, maybe we could just disable RequestQueue.openTemporary("default") with throw new Error("You don't want this, trust me bro").

Also, I'm not married to the name openTemporary, I'm sure we could come up with something better.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

I mean, we already persist the state of multiple random components into the default key-value store, so I have no issue with that.

barjin added the feature Issues that represent new features or improvements to existing features. label Nov 22, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 22, 2024

B4nan added this to the 4.0 milestone Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: creating multiple unnamed queues #2752

feat: creating multiple unnamed queues #2752

barjin commented Nov 22, 2024 •

edited

Loading

janbuchar commented Nov 25, 2024

barjin commented Nov 25, 2024

janbuchar commented Nov 25, 2024 •

edited

Loading

feat: creating multiple unnamed queues #2752

feat: creating multiple unnamed queues #2752

Comments

barjin commented Nov 22, 2024 • edited Loading

janbuchar commented Nov 25, 2024

barjin commented Nov 25, 2024

janbuchar commented Nov 25, 2024 • edited Loading

barjin commented Nov 22, 2024 •

edited

Loading

janbuchar commented Nov 25, 2024 •

edited

Loading