Description
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/core
Issue description
I have no code to reproduce, but I have some suspicion about how this might happen. We crawl a static list of unique URLs with
return crawler
.run(URLs)
and the following handlers:
async requestHandler({ pushData, request, response }) {
let statusCode = response.statusCode;
if (
response.statusCode === 200 &&
decodeURI(request.url) !== decodeURI(request.loadedUrl)
) {
log.debug(
`Request URL: ${request.url} doesn't match loaded URL: ${request.loadedUrl}, treating as 302...`,
);
statusCode = 302;
}
await pushData({
url: request.url,
statusCode,
});
},
async failedRequestHandler({ pushData, request, response }) {
log.error(`Request for URL "${request.url}" failed.`);
await pushData({
url: request.url,
statusCode: response?.statusCode ?? 0,
});
},
async errorHandler(_req, { message }) {
log.error(`Request failed with ${message}`);
}
After calling Dataset.exportToJSON(fileName)
we sometimes end up with the same URL twice. Once with a statusCode: 200
and once with statusCode: 0
. I assume that this happens when a request takes very long, almost until the timeout, say 29.5 seconds in the default case. Now the framework calls requestHandler
if the pushData
takes longer than 0.5 seconds, the framework will call failedRequestHandler
.
I think this is the corresponding code in the timeout package where you can see that the requestHandler invocation is part of the 30 seconds timeout:
async function addTimeoutToPromise(handler, timeoutMillis, errorMessage) {
const context = storage.getStore() ?? {
cancelTask: new AbortController()
};
let returnValue;
const wrap = /* @__PURE__ */ __name(async () => {
try {
returnValue = await handler();
} catch (e) {
if (!(e instanceof InternalTimeoutError)) {
throw e;
}
}
}, "wrap");
return new Promise((resolve, reject) => {
const timeout = setTimeout(() => {
context.cancelTask.abort();
const error = errorMessage instanceof Error ? errorMessage : new TimeoutError(errorMessage);
reject(error);
}, timeoutMillis);
storage.run(context, () => {
wrap().then(() => resolve(returnValue)).catch(reject).finally(() => clearTimeout(timeout));
});
});
}
Code sample
Package version
3.11.5
Node.js version
The one from apify/actor-node-playwright-chrome:20
Operating system
apify/actor-node-playwright-chrome:20 Docker
Apify platform
- Tick me if you encountered this issue on the Apify platform
I have tested this on the next
release
No response
Other context
No response