[Feature Request] Convert `BaseMessage` to alpaca format #1184

Wendong-Fan · 2024-11-15T17:39:16Z

Required prerequisites

I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Motivation

alpaca format:
{"instruction": "...", "input": "...", "output": "..."}

Solution

No response

Alternatives

No response

Additional context

No response

CaelumF · 2024-11-19T17:57:42Z

It's maybe not appropriate to convert BaseMessages directly to Alpaca format, since BaseMessages are part of a conversation structure, with roles and other information that is expected to be irrelevant to Alpaca format (How would we convert an alpaca message to a BaseMessage?). Typing it this way probably adds unhelpful coupling.

I think converting to/from strings is more appropriate. See master...alpaca_conversion_temp

(cc @lightaime)

Wendong-Fan · 2024-11-19T19:59:06Z

Thanks @CaelumF , the scope of this issue is just to convert BaseMessage to alpaca format, didn't consider covert alpaca content back to BaseMessage, so additional information like role name could just be ignored, do we have requirement to make alpaca to BaseMessage?

covert from string using regex has 2 limitation

we need to produce string with the defined format, it's not natively supported by CAMEL so we still add some further implementation
using regex instead of extracting from a structured object is more risky and unreliable

CaelumF · 2024-11-20T14:52:42Z

Yeah when it comes to the generation of alpaca items, it makes way more sense to do things in JSON, especially when structured output and JSON proficiency are available in the inference model. (I assume JSON is the textual representation you had in mind). The linked class can be converted to/from json as its a pydantic class.

But I also assume the plan/expectation is to have the alpaca entries just inside of the text portion of the messages as textual representations, rather than adding any specific fields to the BaseMessage? So the source information is always in one place in the form of text, and no type or contextual information is constraining the content of those messages (like we won't have an AlpacaBaseMessage or something)

The other textual representation that starts with ### Instruction ... is used for inference and training on base models. It's not found in datasets because its awkward for other purposes (though maybe sometimes data will be saved in that representation), but I added that to the pydantic class because pydantic already makes json easy and it is convenient for training and inference to have the representation with ###

Since in this conversion all of the information will be coming from one field of BaseMessage (content) which is always a String, and sometimes it will be useful to come from strings from other sources, it feels more versatile and less confusing to make the conversion just to work in terms of strings.

I can imagine some scenarios with multiple stages of data generation where it can be useful to go back from a textual representation to a validated object form too, in general I like what is communicated by the directions things can be converted. Or if we want to parse Alpaca items which were generated by a base model trained on that format, which it seems Alpaca was. (I'm not sure exactly why JSON wasn't just always used, maybe its because of newline handling or something)

If we want to make the conversion easily discoverable, we can add a to_alpaca function inside of BaseMessage that is a single line calling the publicly available conversion function that takes a string using the message property, to make it clear that only the content is coming from the basemessage

Wendong-Fan added New Feature P0 Task with high level priority labels Nov 15, 2024

Wendong-Fan added this to Project Camel Nov 15, 2024

Wendong-Fan changed the title ~~[Feature Request] Covert BaseMessage to~~ [Feature Request] Covert BaseMessage to alpaca format Nov 15, 2024

Wendong-Fan added this to the Sprint 17 milestone Nov 15, 2024

Wendong-Fan assigned Wendong-Fan and liuxukun2000 and unassigned Wendong-Fan Nov 18, 2024

Wendong-Fan linked a pull request Nov 19, 2024 that will close this issue

feat: add data collector for dataset generation #1193

Merged

13 tasks

CaelumF mentioned this issue Nov 21, 2024

feat: Alpaca pydantic class for easy conversion, validation, and structured output generation #1202

Merged

9 tasks

CaelumF changed the title ~~[Feature Request] Covert BaseMessage to alpaca format~~ [Feature Request] Convert BaseMessage to alpaca format Nov 25, 2024

Wendong-Fan closed this as completed in #1202 Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Convert `BaseMessage` to alpaca format #1184

[Feature Request] Convert `BaseMessage` to alpaca format #1184

Wendong-Fan commented Nov 15, 2024

CaelumF commented Nov 19, 2024

Wendong-Fan commented Nov 19, 2024

CaelumF commented Nov 20, 2024 •

edited

Loading

[Feature Request] Convert BaseMessage to alpaca format #1184

[Feature Request] Convert BaseMessage to alpaca format #1184

Comments

Wendong-Fan commented Nov 15, 2024

Required prerequisites

Motivation

Solution

Alternatives

Additional context

CaelumF commented Nov 19, 2024

Wendong-Fan commented Nov 19, 2024

CaelumF commented Nov 20, 2024 • edited Loading

[Feature Request] Convert `BaseMessage` to alpaca format #1184

[Feature Request] Convert `BaseMessage` to alpaca format #1184

CaelumF commented Nov 20, 2024 •

edited

Loading