Skip to content

Conversation

@Xyerophyte
Copy link

Summary

This PR adds configurable timeout handling to FetchNode to prevent indefinite blocking on long-running HTTP requests and PDF parsing operations.

Problem

  • Line 263 in fetch_node.py: requests.get(source) had no timeout parameter
  • PDF parsing via PyPDFLoader.load() had no timeout mechanism
  • Could lead to indefinite hangs on slow/unresponsive servers or large documents

Solution

1. Added Timeout Configuration

  • New timeout attribute (default: 30 seconds)
  • Configurable via node_config['timeout']
  • Can be disabled by setting to None

2. Applied Timeout to HTTP Requests

  • requests.get() now uses timeout parameter when configured
  • Conditionally applied to preserve backward compatibility

3. Applied Timeout to PDF Parsing

  • Wrapped PyPDFLoader.load() in ThreadPoolExecutor
  • Enforces timeout and raises descriptive TimeoutError on timeout

4. Propagated Timeout to ChromiumLoader

  • Automatically passes timeout to ChromiumLoader via loader_kwargs
  • Respects explicit loader_kwargs['timeout'] if already set

Usage Examples

# Default 30s timeout
node = FetchNode(input="url", output=["doc"], node_config={})

# Custom timeout
node = FetchNode(input="url", output=["doc"], node_config={"timeout": 15})

# Disable timeout (legacy behavior)
node = FetchNode(input="url", output=["doc"], node_config={"timeout": None})

Testing

  • Added comprehensive unit tests in tests/test_fetch_node_timeout.py
  • Tests cover: default/custom/disabled timeout, HTTP requests, PDF parsing, ChromiumLoader propagation

Backward Compatibility

✅ Fully backward compatible:

  • Existing code works without changes (gets 30s timeout)
  • Legacy behavior available by setting timeout=None
  • No breaking changes

Related Issue

Addresses the issue mentioned in line 263 of fetch_node.py where requests could block indefinitely.

- Add timeout parameter to FetchNode (default: 30 seconds)
- Apply timeout to requests.get() calls to prevent indefinite hangs
- Implement timeout for PDF parsing using ThreadPoolExecutor
- Propagate timeout to ChromiumLoader via loader_kwargs
- Add comprehensive unit tests for timeout functionality
- Fully backward compatible (timeout can be disabled with None)

Fixes issue with requests.get() and PDF parsing blocking indefinitely
on slow/unresponsive servers or large documents.

Usage:
  node_config={'timeout': 30}  # Custom timeout
  node_config={'timeout': None}  # Disable timeout
  node_config={}  # Use default 30s timeout
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request tests Improvements or additions to test labels Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files. tests Improvements or additions to test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant