Skip to content

Files

Latest commit

Apr 23, 2025
3741ac0 · Apr 23, 2025

History

History
44 lines (29 loc) · 2.72 KB

error_handling.mdx

File metadata and controls

44 lines (29 loc) · 2.72 KB
id title description
error-handling
Error handling
How to handle errors that occur during web crawling.

import ApiLink from '@site/src/components/ApiLink'; import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import HandleProxyError from '!!raw-loader!roa-loader!./code_examples/error_handling/handle_proxy_error.py'; import ChangeHandleErrorStatus from '!!raw-loader!roa-loader!./code_examples/error_handling/change_handle_error_status.py'; import DisableRetry from '!!raw-loader!roa-loader!./code_examples/error_handling/disable_retry.py';

This guide demonstrates techniques for handling common errors encountered during web crawling operations.

Handling proxy errors

Low-quality proxies can cause problems even with high settings for max_request_retries and max_session_rotations in BasicCrawlerOptions. If you can't get data because of proxy errors, you might want to try again. You can do this using failed_request_handler:

{HandleProxyError}

You can use this same approach when testing different proxy providers. To better manage this process, you can count proxy errors and stop the crawler if you get too many.

Changing how error status codes are handled

By default, when Sessions get status codes like 401, 403, or 429, Crawlee marks the Session as retire and switches to a new one. This might not be what you want, especially when working with authentication. You can learn more in the Session management guide.

Here's an example of how to change this behavior:

{ChangeHandleErrorStatus}

Turning off retries for non-network errors

Sometimes you might get unexpected errors when parsing data, like when a website has an unusual structure. Crawlee normally tries again based on your max_request_retries setting, but sometimes you don't want that.

Here's how to turn off retries for non-network errors using error_handler, which runs before Crawlee tries again:

{DisableRetry}