Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Convert BaseMessage to alpaca format #1184

Closed
1 of 2 tasks
Wendong-Fan opened this issue Nov 15, 2024 · 3 comments · Fixed by #1193 or #1202
Closed
1 of 2 tasks

[Feature Request] Convert BaseMessage to alpaca format #1184

Wendong-Fan opened this issue Nov 15, 2024 · 3 comments · Fixed by #1193 or #1202
Assignees
Labels
New Feature P0 Task with high level priority
Milestone

Comments

@Wendong-Fan
Copy link
Member

Required prerequisites

Motivation

alpaca format:
{"instruction": "...", "input": "...", "output": "..."}

Solution

No response

Alternatives

No response

Additional context

No response

@Wendong-Fan Wendong-Fan added New Feature P0 Task with high level priority labels Nov 15, 2024
@Wendong-Fan Wendong-Fan changed the title [Feature Request] Covert BaseMessage to [Feature Request] Covert BaseMessage to alpaca format Nov 15, 2024
@Wendong-Fan Wendong-Fan added this to the Sprint 17 milestone Nov 15, 2024
@CaelumF
Copy link
Collaborator

CaelumF commented Nov 19, 2024

It's maybe not appropriate to convert BaseMessages directly to Alpaca format, since BaseMessages are part of a conversation structure, with roles and other information that is expected to be irrelevant to Alpaca format (How would we convert an alpaca message to a BaseMessage?). Typing it this way probably adds unhelpful coupling.

I think converting to/from strings is more appropriate. See master...alpaca_conversion_temp

(cc @lightaime)

@Wendong-Fan Wendong-Fan linked a pull request Nov 19, 2024 that will close this issue
13 tasks
@Wendong-Fan
Copy link
Member Author

Thanks @CaelumF , the scope of this issue is just to convert BaseMessage to alpaca format, didn't consider covert alpaca content back to BaseMessage, so additional information like role name could just be ignored, do we have requirement to make alpaca to BaseMessage?

covert from string using regex has 2 limitation

  1. we need to produce string with the defined format, it's not natively supported by CAMEL so we still add some further implementation
  2. using regex instead of extracting from a structured object is more risky and unreliable

@CaelumF
Copy link
Collaborator

CaelumF commented Nov 20, 2024

Yeah when it comes to the generation of alpaca items, it makes way more sense to do things in JSON, especially when structured output and JSON proficiency are available in the inference model. (I assume JSON is the textual representation you had in mind). The linked class can be converted to/from json as its a pydantic class.

But I also assume the plan/expectation is to have the alpaca entries just inside of the text portion of the messages as textual representations, rather than adding any specific fields to the BaseMessage? So the source information is always in one place in the form of text, and no type or contextual information is constraining the content of those messages (like we won't have an AlpacaBaseMessage or something)

The other textual representation that starts with ### Instruction ... is used for inference and training on base models. It's not found in datasets because its awkward for other purposes (though maybe sometimes data will be saved in that representation), but I added that to the pydantic class because pydantic already makes json easy and it is convenient for training and inference to have the representation with ###

Since in this conversion all of the information will be coming from one field of BaseMessage (content) which is always a String, and sometimes it will be useful to come from strings from other sources, it feels more versatile and less confusing to make the conversion just to work in terms of strings.

I can imagine some scenarios with multiple stages of data generation where it can be useful to go back from a textual representation to a validated object form too, in general I like what is communicated by the directions things can be converted. Or if we want to parse Alpaca items which were generated by a base model trained on that format, which it seems Alpaca was. (I'm not sure exactly why JSON wasn't just always used, maybe its because of newline handling or something)

If we want to make the conversion easily discoverable, we can add a to_alpaca function inside of BaseMessage that is a single line calling the publicly available conversion function that takes a string using the message property, to make it clear that only the content is coming from the basemessage

@CaelumF CaelumF changed the title [Feature Request] Covert BaseMessage to alpaca format [Feature Request] Convert BaseMessage to alpaca format Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment