Skip to content

Improve logging by suppresing irrelevant parts of stack trace #1158

@Pijukatel

Description

@Pijukatel

Exceptions logged by crawlers should not contain irrelevant stack traces.

Context: Crawlers can log exceptions and continue running. For example TimeoutError that happened in request handler function. Many of these exceptions contain framework code related stack traces, that is completely irrelevant for the end user. This clutters the exception info and makes logs less readable.

Example of cluttered log:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File ".../repos/crawlee-python/src/crawlee/crawlers/_basic/_context_pipeline.py", line 82, in __call__
          await final_context_consumer(cast('TCrawlingContext', crawling_context))
        File ".../repos/crawlee-python/src/crawlee/router.py", line 98, in __call__
          return await self._default_handler(context)
        File ".../repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1301, in default_handler
          await asyncio.sleep(5)
        File "/usr/lib/python3.10/asyncio/tasks.py", line 605, in sleep
          return await future
      asyncio.exceptions.CancelledError

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):
        File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
          return fut.result()
      asyncio.exceptions.CancelledError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File ".../repos/crawlee-python/src/crawlee/_utils/wait.py", line 37, in wait_for
          return await asyncio.wait_for(operation(), timeout.total_seconds())
        File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
          raise exceptions.TimeoutError() from exc
      asyncio.exceptions.TimeoutError

      The above exception was the direct cause of the following exception:

      Traceback (most recent call last):
        File ".../repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1112, in __run_task_function
          await self._run_request_handler(context=context)
        File ".../repos/crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py", line 1209, in _run_request_handler
          await wait_for(
        File ".../repos/crawlee-python/src/crawlee/_utils/wait.py", line 39, in wait_for
          raise asyncio.TimeoutError(timeout_message) from ex
      asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds

Example of focused log with only the relevant information:

[crawlee.crawlers._basic._basic_crawler] ERROR Request failed and reached maximum retries
         asyncio.exceptions.TimeoutError: Request handler timed out after 1.0 seconds
        
        Request handler was interrupted at:
          File ".../repos/crawlee-python/tests/unit/crawlers/_basic/test_basic_crawler.py", line 1301, in default_handler
            await asyncio.sleep(5)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions