Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation of user data using Unitxt #176

Open
Roni-Friedman opened this issue Nov 7, 2024 · 3 comments · May be fixed by #156
Open

Evaluation of user data using Unitxt #176

Roni-Friedman opened this issue Nov 7, 2024 · 3 comments · May be fixed by #156
Assignees

Comments

@Roni-Friedman
Copy link

Here is the suggested flow. Let's discuss in a meeting to see it makes sense and modify as needed:

Evaluation command [ilab model evaluate new_data] will have the following parameters:

1 - csv_path for user data

  • A csv file with two required column ('instruction','input') and two optional columns ('answer','context')
  • 'context' column is for the RAG task only
  • 'answer' is the golden truth, if available
  • 'instruction' explain the task ("Summarize this text", "Complete this sentence", "Classify this input to one of the following: ... ")

2 - task_type out of the following options:

  • Classification
  • Question Answering [Let's discuss if we want to explicitly offer QA multichoice and simple QA as two separate options]
  • Summarization
  • Generation
  • RAG
  • Other [Let's discuss if this can actually be removed as it will get the same treatment as QA behind the scenes]

3 - use_llmaaj (False by default)

  • False - default standard metric for the task
    • for some tasks the default is llmaaj to begin with: QA, Generation, Other
    • llmaaj will be used if golden answers are not available
  • True - Uses judge (with templates pre defined for the task type)

4 - num_shots (0 by default)

  • Do we want to allow the user select num shots?
  • Do we want to drop this option, run a few configurations (0, 2, 5 shots) and inform the user which is the best setting?

Following the command, unitxt will run the provided data with the task of choice, and replace the metric if llmaaj is selected.
The data will be run in multiple configurations (fitted into different templates that match the task).
Results will include a recommendation for the best template of those used.

@nathan-weinberg nathan-weinberg linked a pull request Nov 14, 2024 that will close this issue
@Roni-Friedman
Copy link
Author

@alimaredia Following yesterday's meeting, could you please share the evaluation notebook you prepared? This will help us understand the plan better and identify what contribution can be offered with unitxt

@Roni-Friedman
Copy link
Author

@danmcp @alimaredia regarding the linked PR - I have addressed all issues, except the parameterization of the unitxt recipe, which I believe is no longer relevant to our current discussion. Perhaps it is better to close it and open a new one once we've defined the features it will contain?

@danmcp
Copy link
Member

danmcp commented Nov 20, 2024

@danmcp @alimaredia regarding the linked PR - I have addressed all issues, except the parameterization of the unitxt recipe, which I believe is no longer relevant to our current discussion. Perhaps it is better to close it and open a new one once we've defined the features it will contain?

I don't have a strong preference whether we close it of leaving it hanging out for a bit. Agree if we do settle in on a different design it should be a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants