Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Preliminary implementation of self-instruct pipeline #1276

Merged
merged 20 commits into from
Dec 22, 2024

Conversation

AveryYay
Copy link
Collaborator

@AveryYay AveryYay commented Dec 4, 2024

Description

A basic implementation of self-instruct pipeline

Motivation and Context

Why is this change required? What problem does it solve?
close #1214

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds core functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)
  • Example (update in the folder of example)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

  • I have read the CONTRIBUTION guide. (required)
  • My change requires a change to the documentation.
  • I have updated the tests accordingly. (required for a bug fix or a new feature)
  • I have updated the documentation accordingly.

@AveryYay AveryYay linked an issue Dec 4, 2024 that may be closed by this pull request
2 tasks
@AveryYay AveryYay self-assigned this Dec 4, 2024
@Wendong-Fan Wendong-Fan marked this pull request as ready for review December 7, 2024 13:36
@Wendong-Fan Wendong-Fan added this to the Sprint 18 milestone Dec 7, 2024
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AveryYay ! Overall looks great, left some comments below

Comment on lines 190 to 193
clf_prompt += " Only answer yes or no"
response = self.agent.step(clf_prompt)
result = response.msgs[0].content.strip().lower()
return result in ["yes", "true"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make the response from agent more robust by using structured output by define a BaseModel like below

class AgentResponse(BaseModel):
    answer: bool = Field(..., description="Indicates whether the task is classification (True/False).")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I will take a look into this!

Copy link
Collaborator

@CaelumF CaelumF Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ideally we replace all string splitting on LLM-produced text with structured output. Potentially relevant is my approach here #1289 though it's still under review

camel/synthetic_datagen/self_instruct/self_instruct.py Outdated Show resolved Hide resolved
camel/synthetic_datagen/self_instruct/self_instruct.py Outdated Show resolved Hide resolved
camel/synthetic_datagen/self_instruct/self_instruct.py Outdated Show resolved Hide resolved
AveryYay and others added 2 commits December 16, 2024 17:22
Co-authored-by: Wendong-Fan <133094783+Wendong-Fan@users.noreply.github.com>
@AveryYay AveryYay changed the title Preliminary implementation of self-instruct pipeline feat: Preliminary implementation of self-instruct pipeline Dec 17, 2024
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Avery, great job! Just a small question: why are you using data where is_classification is set to false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed data contains classification instructions as well, but most of the data points have is_classification set to false. The code you're looking at is an example case where I only sampled 6 datapoints, so it's likely that all 6 samples happened to have is_classification as false. If we want to deal with the classification instructions, we could insert more datapoints into the seed.

Copy link
Collaborator

@CaelumF CaelumF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts left. I really like the typed composable filter approach btw!

Comment on lines 42 to 53
r"""Filters instructions based on their word count.

Args:
r"""Filters instructions based on their word count.

Args:
min_len (int): The minimum word count required for an instruction.
(default::obj:`5`)
max_len (int): The maximum word count allowed for an instruction.
(default::obj:`200`)
"""
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring here need to be fixed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed locally, will push later

Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AveryYay !

@Wendong-Fan Wendong-Fan merged commit e7a86dc into master Dec 22, 2024
6 checks passed
@Wendong-Fan Wendong-Fan deleted the feat/self-instruct-pipeline branch December 22, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[Feature Request] Self-instruct pipeline refine
4 participants