Add Tasks to replicate `APIGen` #925

plaguss · 2024-08-23T10:47:54Z

Description

This PR adds Tasks to replicate APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, which yielded the following dataset: Salesforce/xlam-function-calling-60k.

The following is a draft pipeline replicating the paper. There needs to be access to a python library with functions that can be transformed to tools. This will be done in a different step/pipeline, as there's no reference for it in the paper.

from pathlib import Path

from datasets import load_dataset

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import (
    APIGenGenerator,
    APIGenSemanticChecker,
    APIGenExecutionChecker
)
from distilabel.steps.tasks.apigen.utils import PrepareExamples
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps import (
    LoadDataFromDicts,
    DataSampler,
    CombineOutputs
)


libpath = Path(__file__).parent / "lib_apigen.py"

data = [
    {
        "func_name": "final_velocity",
        "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
    },
    {
        "func_name": "permutation_count",
        "func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
    },
    {
        "func_name": "getdivision",
        "func_desc": "Divides two numbers by making an API call to a division service.",
    },
    {
        "func_name": "binary_addition",
        "func_desc": "Adds two binary numbers and returns the result as a binary string.",
    },
    {
        "func_name": "swapi_planet_resource",
        "func_desc": "get a specific planets resource",
    },
    {
        "func_name": "disney_character",
        "func_desc": "Find a specific character using this endpoint",
    }
]

# This stage should be done dynamically, a step would help with this,
# but it's for simplcity.
from distilabel.steps.tasks.apigen.utils import load_module_from_path

libpath_module = load_module_from_path(libpath)
tools = getattr(libpath_module, "get_tools")()  # call get_tools()

# TODO: Add in the tools between 0 and 2 extra tools to make the task more challenging.
for row in data:
    # The tools should have a mix where both the correct and irrelevant tools are present.
    row.update({"tools": [tools[row["func_name"]]]})


ds_og = (
    load_dataset("Salesforce/xlam-function-calling-60k", split="train")
    .shuffle(seed=42)
    .select(range(500))
    .to_list()
)


with Pipeline(name="APIGenPipeline") as pipeline:
    loader_seeds = LoadDataFromDicts(data=data)
    # Generates 'func_name' and 'func_desc' columns

    # Original dataset will be used to sample few shots
    sampler = DataSampler(
        data=ds_og,
        size=2,
        samples=len(data),
        batch_size=8,
    )

    prep_examples = PrepareExamples()
    model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
    llm=InferenceEndpointsLLM(
        model_id=model_id,
        tokenizer_id=model_id,
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 2048,
        },
    )
    apigen = APIGenGenerator(
        llm=llm,
        use_default_structured_output=True,
    )
    combine_steps = CombineOutputs()

    execution_checker = APIGenExecutionChecker(
        libpath=str(libpath)
    )

    semantic_checker = APIGenSemanticChecker(llm=llm)

    sampler >> prep_examples
    (
        [loader_seeds, prep_examples] 
        >> combine_steps 
        >> apigen
        >> execution_checker
        >> semantic_checker
    )

github-actions · 2024-08-23T10:49:25Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-925/

codspeed-hq · 2024-08-23T10:54:06Z

CodSpeed Performance Report

Merging #925 will not alter performance

_{Comparing apigen (776b36c) with develop (4b8903b)}

Summary

✅ 1 untouched benchmarks

…enerator step

…prior to vllm

…to apigen

plaguss added 6 commits August 21, 2024 16:05

Add apigen task module

edc1e06

Add tests for apigen

4b372fb

Fix default name for dataset info when requesting the number of examples

3a0da42

checkpoint

01b43ab

Add tests for apigen generator

5058f65

Create jinja template, split methods and add docstrings

8ee19e9

plaguss added the enhancement New feature or request label Aug 23, 2024

plaguss self-assigned this Aug 23, 2024

plaguss added 20 commits August 23, 2024 13:48

Update string format

d95375b

Simplify function setting and move it to load method

02d6803

Add tests for semantic checker

3371a37

Add prompt template for semantic checker

19b9576

Redirect import for semantic checker

9f90191

Fix docstrins for output columns

ef7f263

Add semantic checker task from apigen

76a6da0

Add notes for execution checker

050a744

Merge with develop and fix conflicts

e4be16d

Remove extra jump of line

b8c356c

Add first version of data sampler, step helper for apigen

94ef973

Add tests for data sampler

5cffc3f

Add integration test to check the sampler can be mixed with another g…

952a640

…enerator step

Draft tests for new execution checker

f5994d8

Move helper functions

5c8974a

Draft for execution checker functionality

dab8a8b

Add first version of execution checker and tests

d17cbde

Add tests for utils module of apigen

a2ae5f2

Remove unnecessary step for transformation and rename files for clarity

18be0b8

Fix import

71c0729

plaguss added 27 commits September 25, 2024 15:12

Draft tutorial to replicate paper

0ea95a7

Allow number to be a dict with values and probabilities

d7c6a64

Update pipeline run call

21e0757

Add functionality to load functions from a folder with .py files

82aa352

Fix comment for arg

e70a258

Add example implementation

cbc288c

Add dependency for vllm

71a3517

Fix dependency name

8dceb11

Add setuptools-scm in the script with the dependencies to install it …

f363292

…prior to vllm

Another attempt with system

a43b8e9

Add tests to take into account casting methods

7325cef

Avoid casting and update prompt to ensure argument order is respected

2f7418a

Inform error type on generator

4ac735c

Add extra checks and safeguards for failed answer generation

60c1cd9

Ensure the error is of the expected type

bf1baed

Fix unstructured generation

740d3fe

Remove json fences and fix semantic checker

841d985

Control case of functions without arguments

dcded6a

Add additional checks to run the execution checker

2a76812

Remove additional dependency

58b92be

Merge branch 'develop' of https://github.com/argilla-io/distilabel in…

f2eb160

…to apigen

Try fixing CI error with dependencies

8a9743f

Install dependency for the system

c26cca4

Undo fix attempt

9c756b2

Try fixing llvmlite dependency issue

55ccc1e

Remove additional dependency as it breaks other tests

c5ccf5a

Merge with develop and fix conflict

776b36c

plaguss merged commit 4b056ff into develop Oct 7, 2024
7 checks passed

plaguss deleted the apigen branch October 7, 2024 13:29

plaguss added this to the 1.4.0 milestone Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tasks to replicate `APIGen` #925

Add Tasks to replicate `APIGen` #925

plaguss commented Aug 23, 2024 •

edited

Loading

github-actions bot commented Aug 23, 2024

codspeed-hq bot commented Aug 23, 2024 •

edited

Loading

Add Tasks to replicate APIGen #925

Add Tasks to replicate APIGen #925

Conversation

plaguss commented Aug 23, 2024 • edited Loading

Description

github-actions bot commented Aug 23, 2024

codspeed-hq bot commented Aug 23, 2024 • edited Loading

CodSpeed Performance Report

Merging #925 will not alter performance

Summary

Add Tasks to replicate `APIGen` #925

Add Tasks to replicate `APIGen` #925

plaguss commented Aug 23, 2024 •

edited

Loading

codspeed-hq bot commented Aug 23, 2024 •

edited

Loading