Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: creating multiple unnamed queues #2752

Open
barjin opened this issue Nov 22, 2024 · 3 comments
Open

feat: creating multiple unnamed queues #2752

barjin opened this issue Nov 22, 2024 · 3 comments
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Milestone

Comments

@barjin
Copy link
Contributor

barjin commented Nov 22, 2024

There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having CheerioCrawler for processing most of the pages and a separate keep-alive PlaywrightCrawler instance for processing some specific pages the first crawler finds).

Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g., RequestQueue.open('playwrightQueue')).

The named queues, however, don't get purged with a new script run, so in any subsequent run, the PlaywrightCrawler might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script with rm -rf ./storage && npm start, or similar "hacks".

// open a secondary queue
const secondaryRQ = await RequestQueue.open('Bqueue');
  
const crawlerA = new CheerioCrawler({
  // use the default queue with crawlerA, and add requests to the secondary queue
  requestHandler: async ({ request }) => {
    console.log(`[A] ${request.url}`);

    await secondaryRQ.addRequest({ url: request.url });
  }
});

const crawlerB = new CheerioCrawler({
  // consume the secondary queue
  requestQueue: secondaryRQ,
  requestHandler: ({ request }) => {
    console.log(`[B] ${request.url}`);
  },
});

await crawlerA.run(['http://example.com']);
await crawlerB.run();

Repeated runs yield different results:

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
[B] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
INFO  CheerioCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.

@barjin barjin added the feature Issues that represent new features or improvements to existing features. label Nov 22, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Nov 22, 2024
@janbuchar
Copy link
Contributor

Thanks for opening this! We also talked about how we need to remember IDs of non-default unnamed queues between migrations.

My first idea would be an API like await RequestQueue.openTemporary("some-name"). On memory storage, we'd simply map this to storage/request_queues/__tmp_some-name, for example, and we'd remove this on start just like we do with default. On Apify, we'd have to keep a mapping of storage name => storage id in the key-value store to preserve the storages. Apart from that, it should make no difference.

The same could apply to all three storage types, not just request queues.

@barjin
Copy link
Contributor Author

barjin commented Nov 25, 2024

Thanks for the points!

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

@janbuchar
Copy link
Contributor

janbuchar commented Nov 25, 2024

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

Yeah, the storage thing is pretty confusing as a whole. To boot, the name "default" has special meaning in memory storage (which is filesystem-backed, obviously), but not on Apify. So, maybe we could just disable RequestQueue.openTemporary("default") with throw new Error("You don't want this, trust me bro").

Also, I'm not married to the name openTemporary, I'm sure we could come up with something better.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

I mean, we already persist the state of multiple random components into the default key-value store, so I have no issue with that.

@B4nan B4nan added this to the 4.0 milestone Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants