Commit dde70e7
[Data] Fix HTTP streaming file download by using
## What does this PR do?
Fixes HTTP streaming file downloads in Ray Data's download operation.
Some URIs (especially HTTP streams) require `open_input_stream` instead
of `open_input_file`.
## Changes
- Modified `download_bytes_threaded` in `plan_download_op.py` to try
both `open_input_file` and `open_input_stream` for each URI
- Improved error handling to distinguish between different error types
- Failed downloads now return `None` gracefully instead of crashing
## Testing
```
import pyarrow as pa
from ray.data.context import DataContext
from ray.data._internal.planner.plan_download_op import download_bytes_threaded
# Test URLs: one valid, one 404
urls = [
"https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&",
]
# Create PyArrow table and call download function
table = pa.table({"url": urls})
ctx = DataContext.get_current()
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
# Check results
result_table = results[0]
for i in range(result_table.num_rows):
url = result_table['url'][i].as_py()
bytes_data = result_table['bytes'][i].as_py()
if bytes_data is None:
print(f"Row {i}: FAILED (None) - try-catch worked ✓")
else:
print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)")
print(f" URL: {url[:60]}...")
print("\n✅ Test passed: Failed downloads return None instead of crashing.")
```
Before the fix:
```
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ray/default/test_streaming_fallback.py", line 110, in <module>
test_download_expression_with_streaming_fallback()
File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback
with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__
if not self.__exit__(*sys.exc_info()):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__
setattr(self.target, self.attribute, self.temp_original)
TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem'
(base) ray@ip-10-0-39-21:~/default$ python test.py
2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker!
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
Traceback (most recent call last):
File "/home/ray/default/test.py", line 16, in <module>
results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded
uri_bytes = list(
^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen
raise item
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker
for result in fn(input_queue_iter):
^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes
yield f.read()
^^^^^^^^
File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read
File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
```
After the fix:
```
Row 0: SUCCESS (189370 bytes)
URL: https://static-assets.tesla.com/configurator/compositor?cont...
```
Tested with HTTP streaming URLs (e.g., Tesla configurator images) that
previously failed:
- ✅ Successfully downloads HTTP stream files
- ✅ Gracefully handles failed downloads (returns None)
- ✅ Maintains backward compatibility with existing file downloads
---------
Signed-off-by: xyuzh <xinyzng@gmail.com>
Signed-off-by: Robert Nishihara <robertnishihara@gmail.com>
Co-authored-by: Robert Nishihara <robertnishihara@gmail.com>open_input_stream (#58542)1 parent 438d6dc commit dde70e7
File tree
2 files changed
+36
-35
lines changed- python/ray/data
- _internal/planner
- tests
2 files changed
+36
-35
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
189 | 189 | | |
190 | 190 | | |
191 | 191 | | |
| 192 | + | |
192 | 193 | | |
193 | | - | |
194 | | - | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
195 | 197 | | |
196 | 198 | | |
197 | | - | |
| 199 | + | |
198 | 200 | | |
199 | | - | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
200 | 209 | | |
201 | 210 | | |
202 | 211 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
294 | 294 | | |
295 | 295 | | |
296 | 296 | | |
297 | | - | |
298 | | - | |
| 297 | + | |
| 298 | + | |
299 | 299 | | |
300 | | - | |
301 | | - | |
| 300 | + | |
| 301 | + | |
302 | 302 | | |
303 | | - | |
304 | | - | |
305 | | - | |
306 | | - | |
307 | | - | |
308 | | - | |
309 | | - | |
310 | | - | |
311 | | - | |
312 | | - | |
313 | | - | |
314 | | - | |
315 | | - | |
316 | | - | |
317 | | - | |
318 | | - | |
319 | | - | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
320 | 317 | | |
321 | | - | |
| 318 | + | |
322 | 319 | | |
323 | | - | |
324 | | - | |
325 | 320 | | |
326 | | - | |
327 | 321 | | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
333 | | - | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
334 | 326 | | |
335 | 327 | | |
336 | 328 | | |
| |||
0 commit comments