id | title | description |
---|---|---|
error-handling |
Error handling |
How to handle errors that occur during web crawling. |
import ApiLink from '@site/src/components/ApiLink'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import HandleProxyError from '!!raw-loader!roa-loader!./code_examples/error_handling/handle_proxy_error.py'; import ChangeHandleErrorStatus from '!!raw-loader!roa-loader!./code_examples/error_handling/change_handle_error_status.py'; import DisableRetry from '!!raw-loader!roa-loader!./code_examples/error_handling/disable_retry.py';
This guide demonstrates techniques for handling common errors encountered during web crawling operations.
Low-quality proxies can cause problems even with high settings for max_request_retries
and max_session_rotations
in BasicCrawlerOptions
. If you can't get data because of proxy errors, you might want to try again. You can do this using failed_request_handler
:
You can use this same approach when testing different proxy providers. To better manage this process, you can count proxy errors and stop the crawler if you get too many.
By default, when Sessions
get status codes like 401, 403, or 429, Crawlee marks the Session
as retire
and switches to a new one. This might not be what you want, especially when working with authentication. You can learn more in the Session management guide.
Here's an example of how to change this behavior:
{ChangeHandleErrorStatus}Sometimes you might get unexpected errors when parsing data, like when a website has an unusual structure. Crawlee normally tries again based on your max_request_retries
setting, but sometimes you don't want that.
Here's how to turn off retries for non-network errors using error_handler
, which runs before Crawlee tries again: