Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] pdf2parquet ray version erroring out when downloading models for the very first time #667

Open
1 of 2 tasks
sujee opened this issue Oct 4, 2024 · 4 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something isn't working

Comments

@sujee
Copy link
Contributor

sujee commented Oct 4, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

Happens when running RAY version, with NUM_WORKERS > 1.
Reliably reproducible in google colab
Running the cell again works.

But a negative user experience

(orchestrate pid=1575) 05:41:45 ERROR - Failed to process request worker exception The actor died because of an error raised in its creation task, ray::RayTransformFileProcessor.__init__() (pid=1784, ip=172.28.0.12, actor_id=09c62ae6504057816b30599401000000, repr=<data_processing_ray.runtime.ray.transform_file_processor.RayTransformFileProcessor object at 0x7ee7e55fbc40>)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_file_processor.py", line 46, in __init__
(orchestrate pid=1575)     self.transform = params.get("transform_class", None)(self.transform_params)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform_ray.py", line 40, in __init__
(orchestrate pid=1575)     super().__init__(config)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/pdf2parquet_transform.py", line 105, in __init__
(orchestrate pid=1575)     self._converter = DocumentConverter(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/document_converter.py", line 54, in __init__
(orchestrate pid=1575)     self.model_pipeline = pipeline_cls(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/pipeline/standard_model_pipeline.py", line 18, in __init__
(orchestrate pid=1575)     EasyOcrModel(
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/docling/models/easyocr_model.py", line 21, in __init__
(orchestrate pid=1575)     self.reader = easyocr.Reader(config["lang"])
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 92, in __init__
(orchestrate pid=1575)     detector_path = self.getDetectorPath(detect_network)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/easyocr.py", line 253, in getDetectorPath
(orchestrate pid=1575)     download_and_unzip(self.detection_models[self.detect_network]['url'], self.detection_models[self.detect_network]['filename'], self.model_storage_directory, self.verbose)
(orchestrate pid=1575)   File "/usr/local/lib/python3.10/dist-packages/easyocr/utils.py", line 631, in download_and_unzip
(orchestrate pid=1575)     os.remove(zip_path)
(orchestrate pid=1575) FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'

Reproduction script

https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb

Use open-in-colab link : https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_ray.ipynb

Anything else

No response

OS

Other

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@sujee sujee added the bug Something isn't working label Oct 4, 2024
@blublinsky
Copy link
Collaborator

the error is quite obvious:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.EasyOCR//model/temp.zip'

its either file do not exist or location is wrong

@sujee
Copy link
Contributor Author

sujee commented Oct 4, 2024

Yes, the error is quite obvious 🤣
my suspicion is its caused by a race condition between workers trying to cleanup downloaded artifacts.

Adding:
I see this consistently on Google colab, because each notebook gets their own sandbox.
To re-produce it locally, please delete the cache directory of downloaded artifacts (I am not sure where this is -- probably done by docling?)

@sujee
Copy link
Contributor Author

sujee commented Oct 4, 2024

related : #583

@blublinsky
Copy link
Collaborator

Yea, we know exactly why. Its up to the guys to decide what to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants