-
Notifications
You must be signed in to change notification settings - Fork 761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler instances are not disposed #1670
Comments
Just for confirmation: below is an updated example that leaks enough menory to make it quickly observable. This will trigger a built-in memory leak warnig at around 50 cycles:
WARNING: This will eat up your memory if you let it run too long! import os from "os"
import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
await crawlerDisposalTest(i)
i++
}
async function crawlerDisposalTest(i) {
const navigationQueue = await Crawlee.RequestQueue.open()
await navigationQueue.addRequest({ url: "https://google.de" })
// For illustrative purposes only
Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
writable: true,
value: `PlaywrightCrawler#${i}`
})
let crawler = new Crawlee.PlaywrightCrawler({
requestQueue: navigationQueue,
requestHandler: async ctx => { }
})
await crawler.run()
await crawler.teardown()
crawler = null
console.log(`Available system memory: ${os.freemem()} bytes.`)
await navigationQueue.drop()
} |
So the code that was intended to store images never runs, as Chrom(ium)e does not assign the resource type like I thought it would and we probably shouldn't crawl unsplash.com without good reason but that's all not important. The relevant takeaway of my flawed attempt to increase memory consumption is that the crawler instances have quite a sizeable memory footprint even if you don't store any data on them. The above (updated!) example leaks 1,117 GB on my machine in 100 cycles. |
It seems to me, after some profiling with
|
So it seems that removing the listeners fixes the issue: diff --git a/packages/basic-crawler/src/internals/basic-crawler.ts b/packages/basic-crawler/src/internals/basic-crawler.ts
index 0bd87c44..592f15bf 100644
--- a/packages/basic-crawler/src/internals/basic-crawler.ts
+++ b/packages/basic-crawler/src/internals/basic-crawler.ts
@@ -1174,6 +1174,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
}
await this.autoscaledPool?.abort();
+ this.events.removeAllListeners();
+ process.removeAllListeners('SIGINT');
}
protected _handlePropertyNameChange<New, Old>({
diff --git a/packages/core/src/events/event_manager.ts b/packages/core/src/events/event_manager.ts
index 184abf1a..28027e95 100644
--- a/packages/core/src/events/event_manager.ts
+++ b/packages/core/src/events/event_manager.ts
@@ -105,4 +105,8 @@ export abstract class EventManager {
waitForAllListenersToComplete() {
return this.events.waitForAllListenersToComplete();
}
+
+ removeAllListeners() {
+ return this.events.removeAllListeners();
+ }
} Here are the heap profiles for the example code below: await new Promise(resolve => setTimeout(resolve, 5000))
import os from "os"
import Crawlee from "crawlee"
for (let i = 0; i < 3; i++)
await crawlerDisposalTest(i)
async function crawlerDisposalTest(i) {
const navigationQueue = await Crawlee.RequestQueue.open()
await navigationQueue.addRequest({ url: "https://google.de" })
// For illustrative purposes only
Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
writable: true,
value: `PlaywrightCrawler#${i}`
})
let crawler = new Crawlee.PlaywrightCrawler({
requestQueue: navigationQueue,
requestHandler: async ctx => { }
})
await crawler.run()
await crawler.teardown()
crawler = null
console.log(`Available system memory: ${os.freemem()} bytes.`)
await navigationQueue.drop()
} Note the blue bars, they represent the retained memory of the crawler instances, which can not be garbage collected due to the references in the two event emitters events map. I'm happy to send a pull request, but don't know if this conflicts with anything else so feedback is welcomed. There still seem to be "smaller" leaks, but these are not connected to this issue. |
Thanks for the very detailed issue @matjaeck! Indeed the listeners aren't removed, keeping the reference to the instance. I'll send a patch ASAP with proper attribution. |
@matjaeck your analysis is very interesting. |
@LeMoussel I don't have much experience in using it so I can only give some basic hints:
|
@matjaeck Thanks |
Co-authored-by: matjaeck <80065804+matjaeck@users.noreply.github.com> Closes #1670
I came across a similar thing in #2147 It counts the number of requests until |
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/core
Issue description
With the addition of the "Pausing" feature and its corresponding new warning messages, it occurred to me
that I now have to press STRG+C twice each time I rebuild my crawler during developmentthat crawler instances don't seem to be disposed unless you stop the node process.In the below example, the old crawler instances should by my understanding of garbage collection in JS have been garbage collected, unless the lib somehow keeps references over instantiations. But I'm happy to learn if my understanding or the example is flawed.
Usage:
Install and run example, then after some cycles stop process in terminal with CTRL+C.
You will notice that all crawler instances with their dynamically assigned names are still there. The logging uses an instance method and not a static method (
crawlee/packages/basic-crawler/src/internals/basic-crawler.ts
Line 603 in 49e270c
This is a (potential) memory leak, impact of course depends on how much data you store on crawler instances and how long your process runs.
Ideas?
Code sample
Package version
3.1.1
Node.js version
v16.13.1
Operating system
Ubuntu 18.04.6 LTS
Priority this issue should have
High
The text was updated successfully, but these errors were encountered: