feat: add data collector for dataset generation #1193

liuxukun2000 · 2024-11-19T04:21:44Z

Description

add data collector for dataset generation
Issue #1210

This is only a prototype!

Motivation and Context

close #1184

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of example)

Implemented Tasks

Subtask 1
Subtask 2
Subtask 3

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide. (required)
My change requires a change to the documentation.
I have updated the tests accordingly. (required for a bug fix or a new feature)
I have updated the documentation accordingly.

Wendong-Fan

based on comment below add one more commit: 09b9d89

free feel to leave your comments

camel/data_collector/base.py

camel/data_collector/alpaca_collector.py

CaelumF · 2024-11-26T13:57:07Z

I like how this provides an obvious way to handle multiple agents, though the injection method IMO adds unnecessary coupling to the update memory function, if it's just the messages that are gathered then why not take a list of memories instead ? we could add timestamps to memories to ensure their ordering (that would be helpful for other purposes). If the memories are created through deserialization then the injection approach I think would mean they can't be converted, and it makes things awkward in the logic flow and with copying memories

Wendong-Fan

Thanks @liuxukun2000 , Left some comments below, the docstring could be further enhanced to improve code understanding and maintainability. Additionally, we could leverage memory from the agent by using methods like agent.memory.get_context().

camel/data_collector/base.py

camel/data_collector/alpaca_collector.py

camel/data_collector/base.py

camel/data_collector/alpaca_collector.py

Wendong-Fan · 2024-12-03T17:30:13Z

camel/data_collector/sharegpt_collector.py

+class ConversationItem(BaseModel):
+    from_: Literal["human", "gpt", "function_call", "observation"]
+    value: str
+
+    class Config:
+        fields: ClassVar[Dict[str, str]] = {"from_": "from"}
+        extra = "forbid"
+
+
+class ShareGPTData(BaseModel):
+    system: str
+    tools: str
+    conversations: List[ConversationItem]
+
+    class Config:
+        extra = "forbid"


could we use the BaseModel defined in camel/messages/conversion/models.py?

Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).

Hi Caelum,

Thank you for your thoughtful feedback! In this design, I referred to the format used in LLaMA-Factory (https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md), where tools and the system message are placed in the same position. I was thinking that keeping it consistent with their approach might be a good idea.

What do you think? I’d be happy to hear your thoughts! 😊

Hey @liuxukun2000 , could we unify the model here with ShareGPTMessage under camel/messages/conversion/conversation_models.py? it's ok to add system and tools under this but we need to make it optional

CaelumF · 2024-12-05T12:10:12Z

camel/data_collector/sharegpt_collector.py

+                elif role == OpenAIBackendRole.FUNCTION:
+                    conversations.append(
+                        {
+                            "from": "observation",


Would be good to make the role here configurable. Also I think checking if the message has a result or calls is a more robust way of differentiating between function call and function result (and tool call and tool result in the future), until we have some better type safety in this area

Hi Caelum,

I've already switched to using memory to retrieve history and roles. Regarding "Would be good to make the role here configurable," I'm not entirely sure I fully understand what you mean. Could you clarify? Are you suggesting making the roles customizable in some way?

camel/data_collector/base.py

camel/data_collector/alpaca_collector.py

CaelumF · 2024-12-05T12:33:56Z

camel/data_collector/sharegpt_collector.py

+class ConversationItem(BaseModel):
+    from_: Literal["human", "gpt", "function_call", "observation"]
+    value: str
+
+    class Config:
+        fields: ClassVar[Dict[str, str]] = {"from_": "from"}
+        extra = "forbid"
+
+
+class ShareGPTData(BaseModel):
+    system: str
+    tools: str
+    conversations: List[ConversationItem]
+
+    class Config:
+        extra = "forbid"


Why are the system string and tools here? They might vary between conversations. It can be a nice convenience to have a function to get these in the existing ShareGPTConversation (might require some modification to support different role names) if you want to add that (though the system message should still be in the list of messages).

Wendong-Fan

Thanks @liuxukun2000 ! Added one commit here: 31c25fa, I think after #1193 (comment) is resolved and unit test added we are ready to merge this PR

Wendong-Fan · 2024-12-13T09:38:46Z

#1316
further enhancement to do

liuxukun2000 added 3 commits November 18, 2024 13:56

init

15f0990

Merge branch 'master' into feat/data_collector

24f6714

add base data collectors

5068396

liuxukun2000 requested a review from Wendong-Fan November 19, 2024 04:21

liuxukun2000 self-assigned this Nov 19, 2024

liuxukun2000 added 2 commits November 18, 2024 22:32

reformat code

09bed51

pass precheck

252d70e

Wendong-Fan added the Data Related to camel data processing label Nov 19, 2024

Wendong-Fan added this to the Sprint 17 milestone Nov 19, 2024

Wendong-Fan reviewed Nov 19, 2024

View reviewed changes

update 1 based on review comment

09b9d89

Wendong-Fan reviewed Nov 19, 2024

View reviewed changes

camel/data_collector/alpaca_collector.py Outdated Show resolved Hide resolved

Wendong-Fan linked an issue Nov 19, 2024 that may be closed by this pull request

[Feature Request] Convert BaseMessage to alpaca format #1184

Closed

2 tasks

liuxukun2000 and others added 7 commits November 26, 2024 23:00

Merge branch 'master' into feat/data_collector

9d8e84b

Refine the code according to the comments

44aaac5

Merge branch 'master' into feat/data_collector

2bb9e90

add llm_converter

ed208b8

update license

078537e

pass precommit

75687a5

pass precommit

099a160

liuxukun2000 marked this pull request as ready for review November 29, 2024 02:13

Merge branch 'master' into feat/data_collector

f8acfd1

liuxukun2000 requested a review from Wendong-Fan November 29, 2024 07:04

Wendong-Fan requested review from CaelumF and AveryYay November 29, 2024 08:30

Wendong-Fan reviewed Dec 3, 2024

View reviewed changes

CaelumF requested changes Dec 5, 2024

View reviewed changes

liuxukun2000 and others added 2 commits December 6, 2024 14:58

Merge branch 'master' into feat/data_collector

255a1e4

get messages from memory

9ce5a8a

pass precommit

bae8295

Wendong-Fan requested a review from CaelumF December 7, 2024 14:16

Wendong-Fan and others added 2 commits December 7, 2024 22:37

small format fix

31c25fa

Merge branch 'master' into feat/data_collector

4d09e94

Wendong-Fan approved these changes Dec 7, 2024

View reviewed changes

Wendong-Fan modified the milestones: Sprint 17, Sprint 18 Dec 9, 2024

Wendong-Fan linked an issue Dec 9, 2024 that may be closed by this pull request

[Feature Request] Message-to-Data Format Converter #1210

Closed

2 tasks

liuxukun2000 and others added 5 commits December 9, 2024 15:22

Merge branch 'master' into feat/data_collector

6ba6708

add pytest

e6842fc

reformat code

5803941

Merge branch 'master' into feat/data_collector

1e0d8f8

use class from messages/conversation

443fbfd

liuxukun2000 requested a review from Wendong-Fan December 13, 2024 06:06

Merge branch 'master' into feat/data_collector

08e15a8

Wendong-Fan approved these changes Dec 13, 2024

View reviewed changes

Wendong-Fan merged commit 33c2787 into master Dec 13, 2024
4 of 6 checks passed

Wendong-Fan deleted the feat/data_collector branch December 13, 2024 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add data collector for dataset generation #1193

feat: add data collector for dataset generation #1193

liuxukun2000 commented Nov 19, 2024 •

edited by Wendong-Fan

Loading

Wendong-Fan left a comment •

edited

Loading

CaelumF commented Nov 26, 2024

Wendong-Fan left a comment

Wendong-Fan Dec 3, 2024

CaelumF Dec 5, 2024

liuxukun2000 Dec 6, 2024

Wendong-Fan Dec 11, 2024

CaelumF Dec 5, 2024

liuxukun2000 Dec 6, 2024

CaelumF Dec 5, 2024

Wendong-Fan left a comment •

edited

Loading

Wendong-Fan commented Dec 13, 2024

feat: add data collector for dataset generation #1193

feat: add data collector for dataset generation #1193

Conversation

liuxukun2000 commented Nov 19, 2024 • edited by Wendong-Fan Loading

Description

This is only a prototype!

Motivation and Context

Types of changes

Implemented Tasks

Checklist

Wendong-Fan left a comment • edited Loading

Choose a reason for hiding this comment

CaelumF commented Nov 26, 2024

Wendong-Fan left a comment

Choose a reason for hiding this comment

Wendong-Fan Dec 3, 2024

Choose a reason for hiding this comment

CaelumF Dec 5, 2024

Choose a reason for hiding this comment

liuxukun2000 Dec 6, 2024

Choose a reason for hiding this comment

Wendong-Fan Dec 11, 2024

Choose a reason for hiding this comment

CaelumF Dec 5, 2024

Choose a reason for hiding this comment

liuxukun2000 Dec 6, 2024

Choose a reason for hiding this comment

CaelumF Dec 5, 2024

Choose a reason for hiding this comment

Wendong-Fan left a comment • edited Loading

Choose a reason for hiding this comment

Wendong-Fan commented Dec 13, 2024

liuxukun2000 commented Nov 19, 2024 •

edited by Wendong-Fan

Loading

Wendong-Fan left a comment •

edited

Loading

Wendong-Fan left a comment •

edited

Loading