Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #229: Add cloudpickle support for type-annotated parse_func #305

Closed
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,10 @@ poet = curator.LLM(
Here:
* `prompt_func` takes a row of the dataset as input and returns the prompt for the LLM.
* `response_format` is the structured output class we defined above.
* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.
* `parse_func` takes the input (`row`) and the structured output (`poems`) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object. For best practices with type annotations:
* Define `parse_func` as a module-level function rather than a lambda to ensure proper serialization
* Use the `_DictOrBaseModel` type alias for input/output types: `def parse_func(row: _DictOrBaseModel, response: _DictOrBaseModel) -> _DictOrBaseModel`
* Type annotations are now fully supported thanks to cloudpickle serialization

Now we can apply the `LLM` object to the dataset, which reads very pythonic.
```python
Expand Down Expand Up @@ -201,4 +204,4 @@ npm -v # should print `10.9.0`
```

## Contributing
Contributions are welcome!
Contributions are welcome!
25 changes: 23 additions & 2 deletions examples/poem-generation/poem.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,36 @@ class Poems(BaseModel):
poems_list: List[str] = Field(description="A list of poems.")


from typing import Any, Dict, List, Union

# Type alias for input/output types
_DictOrBaseModel = Union[Dict[str, Any], BaseModel]


def parse_poems(row: _DictOrBaseModel, poems: _DictOrBaseModel) -> _DictOrBaseModel:
"""Parse the poems from the LLM response.

Args:
row: The input row containing the topic
poems: The structured output from the LLM (Poems model)

Returns:
A list of dictionaries containing the topic and poem
"""
if isinstance(poems, Poems):
return [{"topic": row["topic"], "poem": p} for p in poems.poems_list]
return [] # Handle edge case where poems is not a Poems instance


# We define an `LLM` object that generates poems which gets applied to the topics dataset.
poet = curator.LLM(
# The prompt_func takes a row of the dataset as input.
# The row is a dictionary with a single key 'topic' in this case.
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
# `row` is the input row, and `poems` is the Poems class which is parsed from the structured output from the LLM.
parse_func=lambda row, poems: [{"topic": row["topic"], "poem": p} for p in poems.poems_list],
# Use the module-level parse function which supports type annotations
parse_func=parse_poems,
)

# We apply the prompter to the topics dataset.
Expand Down
Loading
Loading