-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add data collector for dataset generation #1193
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on comment below add one more commit: 09b9d89
free feel to leave your comments
I like how this provides an obvious way to handle multiple agents, though the injection method IMO adds unnecessary coupling to the update memory function, if it's just the messages that are gathered then why not take a list of memories instead ? we could add timestamps to memories to ensure their ordering (that would be helpful for other purposes). If the memories are created through deserialization then the injection approach I think would mean they can't be converted, and it makes things awkward in the logic flow and with copying memories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liuxukun2000 , Left some comments below, the docstring could be further enhanced to improve code understanding and maintainability. Additionally, we could leverage memory from the agent by using methods like agent.memory.get_context()
.
class ConversationItem(BaseModel): | ||
from_: Literal["human", "gpt", "function_call", "observation"] | ||
value: str | ||
|
||
class Config: | ||
fields: ClassVar[Dict[str, str]] = {"from_": "from"} | ||
extra = "forbid" | ||
|
||
|
||
class ShareGPTData(BaseModel): | ||
system: str | ||
tools: str | ||
conversations: List[ConversationItem] | ||
|
||
class Config: | ||
extra = "forbid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we use the BaseModel defined in camel/messages/conversion/models.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Caelum,
Thank you for your thoughtful feedback! In this design, I referred to the format used in LLaMA-Factory (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md), where tools and the system message are placed in the same position. I was thinking that keeping it consistent with their approach might be a good idea.
What do you think? I’d be happy to hear your thoughts! 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @liuxukun2000 , could we unify the model here with ShareGPTMessage
under camel/messages/conversion/conversation_models.py
? it's ok to add system
and tools
under this but we need to make it optional
elif role == OpenAIBackendRole.FUNCTION: | ||
conversations.append( | ||
{ | ||
"from": "observation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to make the role here configurable. Also I think checking if the message has a result or calls is a more robust way of differentiating between function call and function result (and tool call and tool result in the future), until we have some better type safety in this area
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Caelum,
I've already switched to using memory to retrieve history and roles. Regarding "Would be good to make the role here configurable," I'm not entirely sure I fully understand what you mean. Could you clarify? Are you suggesting making the roles customizable in some way?
class ConversationItem(BaseModel): | ||
from_: Literal["human", "gpt", "function_call", "observation"] | ||
value: str | ||
|
||
class Config: | ||
fields: ClassVar[Dict[str, str]] = {"from_": "from"} | ||
extra = "forbid" | ||
|
||
|
||
class ShareGPTData(BaseModel): | ||
system: str | ||
tools: str | ||
conversations: List[ConversationItem] | ||
|
||
class Config: | ||
extra = "forbid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liuxukun2000 ! Added one commit here: 31c25fa, I think after #1193 (comment) is resolved and unit test added we are ready to merge this PR
#1316 |
Description
add data collector for dataset generation
Issue #1210
This is only a prototype!
Motivation and Context
close #1184
Types of changes
What types of changes does your code introduce? Put an
x
in all the boxes that apply:Implemented Tasks
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.If you are unsure about any of these, don't hesitate to ask. We are here to help!