-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #229: Add cloudpickle support for type-annotated parse_func #305
Conversation
- Replace datasets.utils._dill with cloudpickle for better type annotation support - Update function serialization to handle type annotations properly - Add cloudpickle dependency to pyproject.toml - Update documentation and examples with type annotation best practices - Maintain backward compatibility with existing code patterns Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
⚙️ Control Options:
|
Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
Looks good to me |
|
src/bespokelabs/curator/llm/llm.py
Outdated
if func is None: | ||
return xxh64("").hexdigest() | ||
|
||
file = BytesIO() | ||
Pickler(file, recurse=True).dump(func) | ||
file.write(cloudpickle.dumps(func)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break other logic supported by huggingface pickler, see #230. is there a better solution than just replacing the pickler? also, i still don't understand the root cause here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I also don't like replacing the pickler. Btw, I was not able to replicate the error on my machine exactly, but I got some warnings using the _dill pickler, which vanished when cloudpickle was used. I'll try to replicate the error on main now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this snippet,
from typing import List
from datasets import Dataset
from pydantic import BaseModel, Field
from bespokelabs import curator
# Create a dataset object for the topics you want to create the poems.
topics = Dataset.from_dict(
{"topic": ["Urban loneliness in a bustling city", "Beauty of Bespoke Labs's Curator library"]}
)
# Define a class to encapsulate a list of poems.
class Poem(BaseModel):
poem: str = Field(description="A poem.")
class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
# removing the type annotation for `poems` works fine.
def parse_func(row, poems: Poems):
return [{"topic": row["topic"], "poem": p.poem} for p in poems.poems_list]
# We define a Prompter that generates poems which gets applied to the topics dataset.
poet = curator.LLM(
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
parse_func=parse_func,
)
poem = poet(topics)
print(poem.to_pandas())
I get the following error:
/Users/shreyaspimpalgaonkar/Library/Caches/pypoetry/virtualenvs/bespokelabs-curator-Ky8xEGex-py3.10/lib/python3.10/site-packages/dill/_dill.py:414: PicklingWarning: Cannot locate reference to <class '__main__.Poem'>.
StockPickler.save(self, obj, save_persistent_id)
/Users/shreyaspimpalgaonkar/Library/Caches/pypoetry/virtualenvs/bespokelabs-curator-Ky8xEGex-py3.10/lib/python3.10/site-packages/dill/_dill.py:414: PicklingWarning: Cannot pickle <class '__main__.Poem'>: __main__.Poem has recursive self-references that trigger a RecursionError.
StockPickler.save(self, obj, save_persistent_id)
/Users/shreyaspimpalgaonkar/Library/Caches/pypoetry/virtualenvs/bespokelabs-curator-Ky8xEGex-py3.10/lib/python3.10/site-packages/dill/_dill.py:414: PicklingWarning: Cannot locate reference to <class '__main__.Poems'>.
StockPickler.save(self, obj, save_persistent_id)
/Users/shreyaspimpalgaonkar/Library/Caches/pypoetry/virtualenvs/bespokelabs-curator-Ky8xEGex-py3.10/lib/python3.10/site-packages/dill/_dill.py:414: PicklingWarning: Cannot pickle <class '__main__.Poems'>: __main__.Poems has recursive self-references that trigger a RecursionError.
StockPickler.save(self, obj, save_persistent_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is due to the class definitions being defined in the same module, and doesn't show up if they're imported from a different module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a low prio issue since it only throws a warning in a very restricted scenario. I'll close this PR and deprioritize this bug for the launch.
…zation Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
- Add dedicated test suite for CustomPickler functionality - Test path normalization and type annotation support - Fix return type annotations in test_prompt.py prompt functions - Add proper type hints for test functions Part of #229: Implement CustomPickler for function serialization Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
Co-Authored-By: ryan@bespokelabs.ai <ryan@bespokelabs.ai>
Fix #229: Add cloudpickle support for type-annotated parse_func
Changes:
Testing:
Link to Devin run: https://app.devin.ai/sessions/a1c6d0d5a504429aa767cd230d4a2a42