Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use huggingface modified pickler to fix path-dependent caching #230

Merged
merged 1 commit into from
Dec 9, 2024

Conversation

vutrung96
Copy link
Contributor

@vutrung96 vutrung96 commented Dec 9, 2024

Similar to #215, this is causing a ton of cache misses in DCFT since the runtime path changes depending on the ray package name when a job is submitted to the Ray remote cluster, e.g. the path is /tmp/ray/ray_pkg_xzy

the Ray package name changes whenever there is a change in the local github repository.

HuggingFace pickler overcomes this problem by modifying the filename: https://github.com/huggingface/datasets/blob/2049c00921c59cdeb835137a1c49639cf175af07/src/datasets/utils/_dill.py#L252

@vutrung96
Copy link
Contributor Author

vutrung96 commented Dec 9, 2024

Tested that this fixes the behavior.

BEFORE:

Two jobs with the same config but different package writes to different hash in curator cache

�[36m(_Completions pid=3532927, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_request_processor:Wrote 110 requests to /home/ray/.cache/curator/evol_instruct_gpt-4o-mini_test_batch__evolve_instruction/8c2067c93ddf83b5/requests_0.jsonl.
�[36m(_Completions pid=3534054, ip=10.120.7.8)�[0m 2024-12-09 08:43:14,924 - bespokelabs.curator.request_processor.base_request_processor - INFO - Wrote 110 requests to /home/ray/.cache/curator/evol_instruct_gpt-4o-mini_test_batch__evolve_instruction/d66bad863d1b8bbd/requests_0.jsonl.

AFTER:

Two jobs with the same config but same package writes to the same hash in curator cache

�[36m(_Completions pid=3539787, ip=10.120.7.8)�[0m 2024-12-09 08:50:37,417 - bespokelabs.curator.request_processor.base_request_processor - INFO - Wrote 110 requests to /home/ray/.cache/curator/evol_instruct_gpt-4o-mini_test_batch__evolve_instruction/f44a6e60a24c7619/requests_0.jsonl.
[36m(_Completions pid=3540802, ip=10.120.7.8)�[0m 2024-12-09 08:51:25,717 - bespokelabs.curator.request_processor.base_request_processor - INFO - Using cached requests. If you want to regenerate the dataset, disable or delete the cache.
�[36m(_Completions pid=3540802, ip=10.120.7.8)�[0m INFO:bespokelabs.curator.request_processor.base_request_processor:Using cached requests. If you want to regenerate the dataset, disable or delete the cache.
�[36m(_Completions pid=3540802, ip=10.120.7.8)�[0m INFO:httpx:HTTP Request: GET https://api.openai.com/v1/batches/batch_67571fdf8e908191a03131c7deb1e769 "HTTP/1.1 200 OK"
�[36m(_Completions pid=3540802, ip=10.120.7.8)�[0m 2024-12-09 08:51:25,910 - bespokelabs.curator.request_processor.openai_batch_request_processor - INFO - 1 out of 1 remaining batches are already submitted.

Copy link
Contributor

@RyanMarten RyanMarten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks

@RyanMarten RyanMarten merged commit bacc2be into dev Dec 9, 2024
2 checks passed
@RyanMarten RyanMarten deleted the trung/path-caching branch December 9, 2024 17:37
Comment on lines +190 to +191
func1 = create_function("module1", Path(tmp_dir))
func2 = create_function("module1", Path(tmp_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they both map to the same file in the same directory?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants