You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
barjin opened this issue
Nov 22, 2024
· 3 comments
Labels
featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.
There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having CheerioCrawler for processing most of the pages and a separate keep-alive PlaywrightCrawler instance for processing some specific pages the first crawler finds).
Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g., RequestQueue.open('playwrightQueue')).
The named queues, however, don't get purged with a new script run, so in any subsequent run, the PlaywrightCrawler might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script with rm -rf ./storage && npm start, or similar "hacks".
// open a secondary queueconstsecondaryRQ=awaitRequestQueue.open('Bqueue');constcrawlerA=newCheerioCrawler({// use the default queue with crawlerA, and add requests to the secondary queuerequestHandler: async({ request })=>{console.log(`[A] ${request.url}`);awaitsecondaryRQ.addRequest({url: request.url});}});constcrawlerB=newCheerioCrawler({// consume the secondary queuerequestQueue: secondaryRQ,requestHandler: ({ request })=>{console.log(`[B] ${request.url}`);},});awaitcrawlerA.run(['http://example.com']);awaitcrawlerB.run();
Repeated runs yield different results:
$ npx tsx ./a.ts
[A] http://example.com
INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
[B] http://example.com
INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
$ npx tsx ./a.ts
[A] http://example.com
INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
INFO CheerioCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}
Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.
The text was updated successfully, but these errors were encountered:
barjin
added
the
feature
Issues that represent new features or improvements to existing features.
label
Nov 22, 2024
Thanks for opening this! We also talked about how we need to remember IDs of non-default unnamed queues between migrations.
My first idea would be an API like await RequestQueue.openTemporary("some-name"). On memory storage, we'd simply map this to storage/request_queues/__tmp_some-name, for example, and we'd remove this on start just like we do with default. On Apify, we'd have to keep a mapping of storage name => storage id in the key-value store to preserve the storages. Apart from that, it should make no difference.
The same could apply to all three storage types, not just request queues.
I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.
On Apify, we'd have to keep a mapping of storage name => storage id in the KVS
Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.
I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.
Yeah, the storage thing is pretty confusing as a whole. To boot, the name "default" has special meaning in memory storage (which is filesystem-backed, obviously), but not on Apify. So, maybe we could just disable RequestQueue.openTemporary("default") with throw new Error("You don't want this, trust me bro").
Also, I'm not married to the name openTemporary, I'm sure we could come up with something better.
On Apify, we'd have to keep a mapping of storage name => storage id in the KVS
Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.
I mean, we already persist the state of multiple random components into the default key-value store, so I have no issue with that.
featureIssues that represent new features or improvements to existing features.t-toolingIssues with this label are in the ownership of the tooling team.
There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having
CheerioCrawler
for processing most of the pages and a separate keep-alivePlaywrightCrawler
instance for processing some specific pages the first crawler finds).Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g.,
RequestQueue.open('playwrightQueue')
).The named queues, however, don't get purged with a new script run, so in any subsequent run, the
PlaywrightCrawler
might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script withrm -rf ./storage && npm start
, or similar "hacks".Repeated runs yield different results:
Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.
The text was updated successfully, but these errors were encountered: