You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Tools/ingest2parquet
What happened + What you expected to happen
Happens when running RAY version, with NUM_WORKERS > 1.
Reliably reproducible in google colab
Running the cell again works.
But a negative user experience
(orchestrate pid=1575) 05:41:45 ERROR - Failed to process request worker exception The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=1784, ip=172.28.0.12, actor_id=09c62ae6504057816b30599401000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7ee7e55fbc40>)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_file_processor.py", line 46, in __init__
(orchestrate pid=1575) self.transform = params.get("transform_class", None)(self.transform_params)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform_ray.py", line 40, in __init__
(orchestrate pid=1575) super().__init__(config)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform.py", line 105, in __init__
(orchestrate pid=1575) self._converter = DocumentConverter(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/document_converter.py", line 54, in __init__
(orchestrate pid=1575) self.model_pipeline = pipeline_cls(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/pipeline/standard_model_pipeline.py", line 18, in __init__
(orchestrate pid=1575) EasyOcrModel(
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/docling/models/easyocr_model.py", line 21, in __init__
(orchestrate pid=1575) self.reader = easyocr.Reader(config["lang"])
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 92, in __init__
(orchestrate pid=1575) detector_path = self.getDetectorPath(detect_network)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 253, in getDetectorPath
(orchestrate pid=1575) download_and_unzip(self.detection_models[self.detect_network]['url'], self.detection_models[self.detect_network]['filename'], self.model_storage_directory, self.verbose)
(orchestrate pid=1575) File "/usr/local/lib/python3.10/dist-packages/easyocr/utils.py", line 631, in download_and_unzip
(orchestrate pid=1575) os.remove(zip_path)
(orchestrate pid=1575) FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'
Yes, the error is quite obvious 🤣
my suspicion is its caused by a race condition between workers trying to cleanup downloaded artifacts.
Adding:
I see this consistently on Google colab, because each notebook gets their own sandbox.
To re-produce it locally, please delete the cache directory of downloaded artifacts (I am not sure where this is -- probably done by docling?)
Search before asking
Component
Tools/ingest2parquet
What happened + What you expected to happen
Happens when running RAY version, with NUM_WORKERS > 1.
Reliably reproducible in google colab
Running the cell again works.
But a negative user experience
Reproduction script
https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb
Use open-in-colab link : https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb
Anything else
No response
OS
Other
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: