Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tasks to replicate APIGen #925

Merged
merged 74 commits into from
Oct 7, 2024
Merged

Add Tasks to replicate APIGen #925

merged 74 commits into from
Oct 7, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Aug 23, 2024

Description

This PR adds Tasks to replicate APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, which yielded the following dataset: Salesforce/xlam-function-calling-60k.

The following is a draft pipeline replicating the paper. There needs to be access to a python library with functions that can be transformed to tools. This will be done in a different step/pipeline, as there's no reference for it in the paper.

from pathlib import Path

from datasets import load_dataset

from distilabel.pipeline import Pipeline
from distilabel.steps.tasks import (
    APIGenGenerator,
    APIGenSemanticChecker,
    APIGenExecutionChecker
)
from distilabel.steps.tasks.apigen.utils import PrepareExamples
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps import (
    LoadDataFromDicts,
    DataSampler,
    CombineOutputs
)


libpath = Path(__file__).parent / "lib_apigen.py"

data = [
    {
        "func_name": "final_velocity",
        "func_desc": "Calculates the final velocity of an object given its initial velocity, acceleration, and time.",
    },
    {
        "func_name": "permutation_count",
        "func_desc": "Calculates the number of permutations of k elements from a set of n elements.",
    },
    {
        "func_name": "getdivision",
        "func_desc": "Divides two numbers by making an API call to a division service.",
    },
    {
        "func_name": "binary_addition",
        "func_desc": "Adds two binary numbers and returns the result as a binary string.",
    },
    {
        "func_name": "swapi_planet_resource",
        "func_desc": "get a specific planets resource",
    },
    {
        "func_name": "disney_character",
        "func_desc": "Find a specific character using this endpoint",
    }
]

# This stage should be done dynamically, a step would help with this,
# but it's for simplcity.
from distilabel.steps.tasks.apigen.utils import load_module_from_path

libpath_module = load_module_from_path(libpath)
tools = getattr(libpath_module, "get_tools")()  # call get_tools()

# TODO: Add in the tools between 0 and 2 extra tools to make the task more challenging.
for row in data:
    # The tools should have a mix where both the correct and irrelevant tools are present.
    row.update({"tools": [tools[row["func_name"]]]})


ds_og = (
    load_dataset("Salesforce/xlam-function-calling-60k", split="train")
    .shuffle(seed=42)
    .select(range(500))
    .to_list()
)


with Pipeline(name="APIGenPipeline") as pipeline:
    loader_seeds = LoadDataFromDicts(data=data)
    # Generates 'func_name' and 'func_desc' columns

    # Original dataset will be used to sample few shots
    sampler = DataSampler(
        data=ds_og,
        size=2,
        samples=len(data),
        batch_size=8,
    )

    prep_examples = PrepareExamples()
    model_id = "meta-llama/Meta-Llama-3.1-70B-Instruct"
    llm=InferenceEndpointsLLM(
        model_id=model_id,
        tokenizer_id=model_id,
        generation_kwargs={
            "temperature": 0.7,
            "max_new_tokens": 2048,
        },
    )
    apigen = APIGenGenerator(
        llm=llm,
        use_default_structured_output=True,
    )
    combine_steps = CombineOutputs()

    execution_checker = APIGenExecutionChecker(
        libpath=str(libpath)
    )

    semantic_checker = APIGenSemanticChecker(llm=llm)

    sampler >> prep_examples
    (
        [loader_seeds, prep_examples] 
        >> combine_steps 
        >> apigen
        >> execution_checker
        >> semantic_checker
    )

@plaguss plaguss added the enhancement New feature or request label Aug 23, 2024
@plaguss plaguss self-assigned this Aug 23, 2024
Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-925/

Copy link

codspeed-hq bot commented Aug 23, 2024

CodSpeed Performance Report

Merging #925 will not alter performance

Comparing apigen (776b36c) with develop (4b8903b)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss merged commit 4b056ff into develop Oct 7, 2024
7 checks passed
@plaguss plaguss deleted the apigen branch October 7, 2024 13:29
@plaguss plaguss added this to the 1.4.0 milestone Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants