feat: Add configurable timeout to FetchNode #1020

Xyerophyte · 2025-11-01T09:09:12Z

Summary

This PR adds configurable timeout handling to FetchNode to prevent indefinite blocking on long-running HTTP requests and PDF parsing operations.

Problem

Line 263 in fetch_node.py: requests.get(source) had no timeout parameter
PDF parsing via PyPDFLoader.load() had no timeout mechanism
Could lead to indefinite hangs on slow/unresponsive servers or large documents

Solution

1. Added Timeout Configuration

New timeout attribute (default: 30 seconds)
Configurable via node_config['timeout']
Can be disabled by setting to None

2. Applied Timeout to HTTP Requests

requests.get() now uses timeout parameter when configured
Conditionally applied to preserve backward compatibility

3. Applied Timeout to PDF Parsing

Wrapped PyPDFLoader.load() in ThreadPoolExecutor
Enforces timeout and raises descriptive TimeoutError on timeout

4. Propagated Timeout to ChromiumLoader

Automatically passes timeout to ChromiumLoader via loader_kwargs
Respects explicit loader_kwargs['timeout'] if already set

Usage Examples

# Default 30s timeout
node = FetchNode(input="url", output=["doc"], node_config={})

# Custom timeout
node = FetchNode(input="url", output=["doc"], node_config={"timeout": 15})

# Disable timeout (legacy behavior)
node = FetchNode(input="url", output=["doc"], node_config={"timeout": None})

Testing

Added comprehensive unit tests in tests/test_fetch_node_timeout.py
Tests cover: default/custom/disabled timeout, HTTP requests, PDF parsing, ChromiumLoader propagation

Backward Compatibility

✅ Fully backward compatible:

Existing code works without changes (gets 30s timeout)
Legacy behavior available by setting timeout=None
No breaking changes

Related Issue

Addresses the issue mentioned in line 263 of fetch_node.py where requests could block indefinitely.

- Add timeout parameter to FetchNode (default: 30 seconds) - Apply timeout to requests.get() calls to prevent indefinite hangs - Implement timeout for PDF parsing using ThreadPoolExecutor - Propagate timeout to ChromiumLoader via loader_kwargs - Add comprehensive unit tests for timeout functionality - Fully backward compatible (timeout can be disabled with None) Fixes issue with requests.get() and PDF parsing blocking indefinitely on slow/unresponsive servers or large documents. Usage: node_config={'timeout': 30} # Custom timeout node_config={'timeout': None} # Disable timeout node_config={} # Use default 30s timeout

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request tests Improvements or additions to test labels Nov 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add configurable timeout to FetchNode #1020

feat: Add configurable timeout to FetchNode #1020

Uh oh!

Xyerophyte commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat: Add configurable timeout to FetchNode #1020

Are you sure you want to change the base?

feat: Add configurable timeout to FetchNode #1020

Uh oh!

Conversation

Xyerophyte commented Nov 1, 2025

Summary

Problem

Solution

1. Added Timeout Configuration

2. Applied Timeout to HTTP Requests

3. Applied Timeout to PDF Parsing

4. Propagated Timeout to ChromiumLoader

Usage Examples

Testing

Backward Compatibility

Related Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant