Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to choose the best timeout value in extractors? #326

Open
jordane95 opened this issue Jan 22, 2025 · 2 comments
Open

How to choose the best timeout value in extractors? #326

jordane95 opened this issue Jan 22, 2025 · 2 comments

Comments

@jordane95
Copy link
Contributor

Hi,

I do not know how to choose the best timeout threshold for running extractor. Shouldn't this threshold be hardware-aware?

@jordane95
Copy link
Contributor Author

btw, my task failed with timeout error. I don't know why it is not successfully captured by the except argument

2025-01-22 04:01:23.916 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
Traceback (most recent call last):
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/regex/_regex_core.py", line 3032, in __del__
raise TimeoutError
File "/opt/conda/envs/datatrove/lib/python3.10/weakref.py", line 106, in remove
raise TimeoutError
TimeoutError:
2025-01-22 04:01:26.748 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:27.343 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:29.492 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:30.577 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
Traceback (most recent call last):
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/regex/_regex_core.py", line 3032, in __del__
def __del__(self):
File "/output/datatrove/src/datatrove/pipeline/extractors/base.py", line 50, in signal_handler
TimeoutError:
2025-01-22 04:01:35.718 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:36.681 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
Exception ignored in: <function Group.__del__ at 0x7f6f4fa77880>
Traceback (most recent call last):
File "/opt/conda/envs/datatrove/lib/python3.10/site-packages/regex/_regex_core.py", line 3032, in __del__
def __del__(self):
File "/output/datatrove/src/datatrove/pipeline/extractors/base.py", line 50, in signal_handler
raise TimeoutError
TimeoutError:
2025-01-22 04:01:39.571 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:41.709 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.
2025-01-22 04:01:43.079 | WARNING | datatrove.pipeline.extractors.base:timeout_extract:58 - ⏰ Timeout while cleaning record text. Skipping record.

@guipenedo
Copy link
Collaborator

are you using the latest version of the master branch or the pip version?
I recently rewrote the timeout checking part and it should be much more robust: 2548cdf
a timeout of 1 seems to work well for me: few docs are skipped and the runtime isn't significantly increased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants