Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add data collector for dataset generation #1193

Merged
merged 25 commits into from
Dec 13, 2024
Merged

Conversation

liuxukun2000
Copy link
Collaborator

@liuxukun2000 liuxukun2000 commented Nov 19, 2024

Description

add data collector for dataset generation
Issue #1210

This is only a prototype!

Motivation and Context

close #1184

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Implemented Tasks

  • Subtask 1
  • Subtask 2
  • Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@liuxukun2000 liuxukun2000 self-assigned this Nov 19, 2024
@Wendong-Fan Wendong-Fan added the Data Related to camel data processing label Nov 19, 2024
@Wendong-Fan Wendong-Fan added this to the Sprint 17 milestone Nov 19, 2024
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on comment below add one more commit: 09b9d89

free feel to leave your comments

camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
@Wendong-Fan Wendong-Fan linked an issue Nov 19, 2024 that may be closed by this pull request
2 tasks
@CaelumF
Copy link
Collaborator

CaelumF commented Nov 26, 2024

I like how this provides an obvious way to handle multiple agents, though the injection method IMO adds unnecessary coupling to the update memory function, if it's just the messages that are gathered then why not take a list of memories instead ? we could add timestamps to memories to ensure their ordering (that would be helpful for other purposes). If the memories are created through deserialization then the injection approach I think would mean they can't be converted, and it makes things awkward in the logic flow and with copying memories

@liuxukun2000 liuxukun2000 marked this pull request as ready for review November 29, 2024 02:13
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liuxukun2000 , Left some comments below, the docstring could be further enhanced to improve code understanding and maintainability. Additionally, we could leverage memory from the agent by using methods like agent.memory.get_context().

camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/alpaca_collector.py Outdated Show resolved Hide resolved
camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/alpaca_collector.py Outdated Show resolved Hide resolved
camel/data_collector/alpaca_collector.py Outdated Show resolved Hide resolved
Comment on lines +50 to +65
class ConversationItem(BaseModel):
from_: Literal["human", "gpt", "function_call", "observation"]
value: str

class Config:
fields: ClassVar[Dict[str, str]] = {"from_": "from"}
extra = "forbid"


class ShareGPTData(BaseModel):
system: str
tools: str
conversations: List[ConversationItem]

class Config:
extra = "forbid"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use the BaseModel defined in camel/messages/conversion/models.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Caelum,

Thank you for your thoughtful feedback! In this design, I referred to the format used in LLaMA-Factory (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md), where tools and the system message are placed in the same position. I was thinking that keeping it consistent with their approach might be a good idea.

What do you think? I’d be happy to hear your thoughts! 😊

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liuxukun2000 , could we unify the model here with ShareGPTMessage under camel/messages/conversion/conversation_models.py? it's ok to add system and tools under this but we need to make it optional

elif role == OpenAIBackendRole.FUNCTION:
conversations.append(
{
"from": "observation",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to make the role here configurable. Also I think checking if the message has a result or calls is a more robust way of differentiating between function call and function result (and tool call and tool result in the future), until we have some better type safety in this area

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Caelum,

I've already switched to using memory to retrieve history and roles. Regarding "Would be good to make the role here configurable," I'm not entirely sure I fully understand what you mean. Could you clarify? Are you suggesting making the roles customizable in some way?

camel/data_collector/base.py Outdated Show resolved Hide resolved
camel/data_collector/alpaca_collector.py Outdated Show resolved Hide resolved
Comment on lines +50 to +65
class ConversationItem(BaseModel):
from_: Literal["human", "gpt", "function_call", "observation"]
value: str

class Config:
fields: ClassVar[Dict[str, str]] = {"from_": "from"}
extra = "forbid"


class ShareGPTData(BaseModel):
system: str
tools: str
conversations: List[ConversationItem]

class Config:
extra = "forbid"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).

@Wendong-Fan Wendong-Fan requested a review from CaelumF December 7, 2024 14:16
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @liuxukun2000 ! Added one commit here: 31c25fa, I think after #1193 (comment) is resolved and unit test added we are ready to merge this PR

@Wendong-Fan Wendong-Fan modified the milestones: Sprint 17, Sprint 18 Dec 9, 2024
@Wendong-Fan Wendong-Fan linked an issue Dec 9, 2024 that may be closed by this pull request
2 tasks
@Wendong-Fan Wendong-Fan merged commit 33c2787 into master Dec 13, 2024
4 of 6 checks passed
@Wendong-Fan Wendong-Fan deleted the feat/data_collector branch December 13, 2024 09:36
@Wendong-Fan
Copy link
Member

#1316
further enhancement to do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Related to camel data processing
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Message-to-Data Format Converter [Feature Request] Convert BaseMessage to alpaca format
3 participants