16 Oct 07:30

gabrielmbmb

844165f

1.4.1 Latest

Latest

What's Changed

Fix not handling list of all primitive types in SignatureMixin by @gabrielmbmb in #1037

Full Changelog: 1.4.0...1.4.1

Contributors

gabrielmbmb

Assets 2

08 Oct 14:53

gabrielmbmb

1.4.0

c0d798a

1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

We’ve updated the LLM interface so now LLMs using an external platform that offers a batch service can be integrated in distilabel. In addition, OpenAILLM has been updated so it can use the OpenAI Batch API to get 50% cost reductions.

distilabel-offline-batch-generation.mp4

Improved cache for maximum outputs reusability

We all know that running LLM is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset generated by one that finished its execution and was re-executed.

In this release, we've greatly improved the cache so the outputs of all the Steps are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:

In addition, we've added a use_cache attribute in the Steps that allows toggling the use of the cache at step level.

Steps can generated artifacts

In some cases, Step produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact that can be called within the step to store artifacts generated by it. The artifacts generated by the Step will also get uploaded to the Hugging Face Hub.

from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt

if TYPE_CHECKING:
    from distilabel.steps import StepOutput


class CountTextCharacters(GlobalStep):
    @property
    def inputs(self) -> List[str]:
        return ["text"]

    @property
    def outputs(self) -> List[str]:
        return ["text_character_count"]

    def process(self, inputs: StepInput) -> "StepOutput":  # type: ignore
        character_counts = []

        for input in inputs:
            text_character_count = len(input["text"])
            input["text_character_count"] = text_character_count
            character_counts.append(text_character_count)

        # Generate plot with the distribution of text character counts
        plt.figure(figsize=(10, 6))
        plt.hist(character_counts, bins=30, edgecolor="black")
        plt.title("Distribution of Text Character Counts")
        plt.xlabel("Character Count")
        plt.ylabel("Frequency")

        # Save the plot as an artifact of the step
        self.save_artifact(
            name="text_character_count_distribution",
            write_function=lambda path: plt.savefig(path / "figure.png"),
            metadata={"type": "image", "library": "matplotlib"},
        )

        plt.close()

        yield inputs

New `Tasks`: `CLAIR`, `APIGEN` and many more!

New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A preferred A’ is much more contrastive and precise.
New tasks to replicate APIGen framework: APIGenGenerator, APIGenSemanticChecker, APIGenExecutionChecker. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
New URIAL task that allows using non-instruct models to generate a response for an instruction.
New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.

New Steps to sample data in your pipelines and remove duplicates

New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
New CombineOutputs to combine the outputs of two or more steps into a single output.

Generate text embeddings using `vLLM`

Now you can generate embeddings using vLLMEmbeddings!

Extra things

Easily visualize the tasks’ prompts using Task.print method.
New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.

What's Changed

Make ClientvLLM.model_name a cached_property by @gabrielmbmb in #862
Pass dataset to dry_run method by @plaguss in #863
Add default structured output for GenerateSentencePair task by @plaguss in #868
Complexity scorer default structured output by @plaguss in #870
Quality scorer default structured output by @plaguss in #873
Ultrafeedback default structured output by @plaguss in #876
Remove use of default_chat_template by @gabrielmbmb in #888
Temporary fix for installing llama-cpp-python by @gabrielmbmb in #886
Fix unit tests after release of transformers==4.44.0 by @gabrielmbmb in #891
Fix default structured output by @plaguss in #892
Send as many batches as possible to input queues by @gabrielmbmb in #895
Exclude repo_id from LoadDataFromFileSystem by @plaguss in #898
Fix loader to read from a glob pattern by @plaguss in #877
Add save_artifact method to _Step by @gabrielmbmb in #871
Add new add_raw_input argument to _Task so we can automatically include the formatted input by @plaguss in #903
New TruncateTextColumn to truncate the length of texts using the number of tokens or characters by @plaguss in #902
Update inputs and outputs interface to allow returning dict indicating optionality by @gabrielmbmb in #883
Update mistrallm by @plaguss in #904
Deepseek prover by @plaguss in #907
Update RewardModelScore.inputs property by @gabrielmbmb in #908
Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
Fix load data from disk by @plaguss in #910
docs: minor fixes by @davidberenstein1957 in #913
Add URIAL task by @gabrielmbmb in #921
Add vLLMEmbeddings by @plaguss in #920
docs: add tutorials preference and clean by @sdiazlor in #917
Fix StructuredGeneration examples and internal check by @plaguss in #912
Generate deterministic pipeline name when it's not given by @plaguss in #878
Add custom errors by @plaguss in #911
Docs/tutorials fix by @sdiazlor in #922
Add revision runtime parameter to LoadDataFromHub by @gabrielmbmb in #928
Add plausible as replacement for GA by @davidberenstein1957 in #929
Add minhash related steps to deduplicate texts by @plaguss in #931
docs: API reference review by @sdiazlor in #932
Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
Update make_generator_step to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://g...

Contributors

dameikle, davidberenstein1957, and 4 other contributors

Assets 2

23 Aug 13:15

gabrielmbmb

1.3.2

ed88585

1.3.2

What's Changed

Deepseek prover task by @plaguss in #733
Do not cancel in progress docs workflows by @gabrielmbmb in #919
Fix creating Ray placement groups for vLLM by @gabrielmbmb in #918
Fix passing base_url in model_id in InferenceEndpointsLLM by @gabrielmbmb in #924

Full Changelog: 1.3.1...1.3.2

Contributors

gabrielmbmb and plaguss

Assets 2

07 Aug 09:09

gabrielmbmb

1.3.1

268358b

1.3.1

What's Changed

Create new distilabel.constants module to store constants and avoid circular imports by @plaguss in #861
Add OpenAI request timeout by @ashim-mahara in #858

New Contributors

@ashim-mahara made their first contribution in #858

Full Changelog: 1.3.0...1.3.1

Contributors

ashim-mahara and plaguss

Assets 2

06 Aug 14:16

gabrielmbmb

1.3.0

63f948b

1.3.0

What's Changed

Add new step CombineKeys by @plaguss in #747
Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in #758
Drop remove deprecated LoadHubDataset by @davidberenstein1957 in #759
Add requirements list for Pipeline by @plaguss in #720
Add StepResources and step replicas in Pipeline by @gabrielmbmb in #750
Add load stages by @gabrielmbmb in #760
Update min required version to python==3.9 by @gabrielmbmb in #770
Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in #762
Add docs-pr.yml and docs-pr-close.yml workflows by @gabrielmbmb in #774
Add RayPipeline class by @gabrielmbmb in #769
Fixed closed PR workflow by @gabrielmbmb in #776
Add Magpie and MagpieGenerator tasks by @gabrielmbmb in #778
Fix some issues related to Magpie task by @gabrielmbmb in #783
Add end_with_user and include_system_prompt flags to Magpie tasks and handle Nones. by @gabrielmbmb in #784
Add workflow concurrency group for publishing docs by @gabrielmbmb in #796
Add _desired_num_gpus attribute to CudaDevicePlacementMixin by @gabrielmbmb in #795
Compatibility with vLLM with tensor_parallel_size argument by @gabrielmbmb in #805
Update default names in GroupColumns by @plaguss in #808
Request batches to GeneratorStep if only step in pipeline by @gabrielmbmb in #828
Add default name for a pipeline by @plaguss in #809
Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in #821
Some more Magpie improvements by @gabrielmbmb in #833
Add Embeddings base class, SentenceTransformerEmbeddings class, EmbeddingGeneration and FaissNearestNeighbour steps by @gabrielmbmb in #830
Create file per hostname in CudaDevicePlacementMixin by @gabrielmbmb in #814
Create a GeneratorStep from a dataset using a helper function by @plaguss in #812
Do not take into account disable_cuda_device_placement for pipeline signature by @gabrielmbmb in #838
Add RewardModelScore step by @gabrielmbmb in #840
Fix LoadDataFromHub attribute _dataset had ellipsis by default instead of None by @gabrielmbmb in #841
Create PlacementGroup for steps using vLLM by @gabrielmbmb in #842
Update argilla integration to use argilla_sdk v2 by @alvarobartt in #705
Make overall-rating the default aspect for UltraFeedback task by @gabrielmbmb in #843
fix typo index.md by @franperic in #844
Use CudaDevicePlacementMixin in RewardModelScore step by @gabrielmbmb in #845
Gather GPUs per Ray node to create placement groups by @gabrielmbmb in #848
Fix typo in docs by @plaguss in #850
Add xfail routing batch function tests by @gabrielmbmb in #852
Fix creating placement group when pipeline_parallel_size>1 by @gabrielmbmb in #851
docs: 846 docs include google analytics by @davidberenstein1957 in #847
Add ClientvLLM class by @gabrielmbmb in #854
Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in #856
Add bibtex references in the docstrings to be shown in the README by @plaguss in #855
distilabel 1.3.0 by @gabrielmbmb in #857

New Contributors

@franperic made their first contribution in #844

Full Changelog: 1.2.4...1.3.0

Contributors

davidberenstein1957, gabrielmbmb, and 3 other contributors

Assets 2

23 Jul 16:03

gabrielmbmb

1.2.4

add2b6e

1.2.4

What's Changed

Update InferenceEndpointsLLM to use chat_completion method by @gabrielmbmb in #815

Full Changelog: 1.2.3...1.2.4

Contributors

gabrielmbmb

Assets 2

23 Jul 08:02

gabrielmbmb

1.2.3

54ecc38

1.2.3

What's Changed

Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in #786
Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in #791
docs: update script for issue dashboard by @sdiazlor in #775
Fix 404 model not found for private Serverless IE by @dvsrepo in #806

New Contributors

@Hassaan-Qaisar made their first contribution in #786

Full Changelog: 1.2.2...1.2.3

Contributors

dvsrepo, Hassaan-Qaisar, and sdiazlor

Assets 2

12 Jul 11:09

gabrielmbmb

1.2.2

a22c7e2

1.2.2

What's Changed

Fix passing input to format_output function by @gabrielmbmb in #781

Full Changelog: 1.2.1...1.2.2

Contributors

gabrielmbmb

Assets 2

01 Jul 08:58

gabrielmbmb

1.2.1

fe615d6

1.2.1

What's Changed

Fix docs for distiset.save_to_disk kwargs by @fpreiss in #745
docs: change references by @sdiazlor in #754
Fix response_format for TogetherLLM and AnyScaleLLM by @gabrielmbmb in #764

New Contributors

@fpreiss made their first contribution in #745

Full Changelog: 1.2.0...1.2.1

Contributors

fpreiss, gabrielmbmb, and sdiazlor

Assets 2

18 Jun 12:40

gabrielmbmb

1.2.0

3910aca

1.2.0

✨ Release highlights

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

instructor has been integrated bringing support for structured generation with OpenAILLM, AnthropicLLM, LiteLLM, MistralLLM, CohereLLM and GroqLLM:

Structured generation with `instructor` example

from typing import List

from distilabel.llms import MistralLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromDicts
from distilabel.steps.tasks import TextGeneration
from pydantic import BaseModel, Field


class Node(BaseModel):
    id: int
    label: str
    color: str


class Edge(BaseModel):
    source: int
    target: int
    label: str
    color: str = "black"


class KnowledgeGraph(BaseModel):
    nodes: List[Node] = Field(..., default_factory=list)
    edges: List[Edge] = Field(..., default_factory=list)


with Pipeline(
    name="Knowledge-Graphs",
    description=(
        "Generate knowledge graphs to answer questions, this type of dataset can be used to "
        "steer a model to answer questions with a knowledge graph."
    ),
) as pipeline:
    sample_questions = [
        "Teach me about quantum mechanics",
        "Who is who in The Simpsons family?",
        "Tell me about the evolution of programming languages",
    ]

    load_dataset = LoadDataFromDicts(
        name="load_instructions",
        data=[
            {
                "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.",
                "instruction": f"{question}",
            }
            for question in sample_questions
        ],
    )

    text_generation = TextGeneration(
        name="knowledge_graph_generation",
        llm=MistralLLM(
            model="open-mixtral-8x22b",
            structured_output={"schema": KnowledgeGraph}
        ),
    )
    load_dataset >> text_generation

InferenceEndpointsLLM now supports structured generation
New StructuredGeneration task that allows defining the schema of the structured generation per input row.

New tasks for generating datasets for training embedding models

sentence-transformers v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!

New GenerateSentencePair task that allows to generate a positive sentence for an input anchor, and optionally also a negative sentence. The tasks allows creating different kind of data specifying the action to perform with respect to the anchor: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer.
Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
- EmbeddingTaskGenerator which allows generating new embedding-related tasks using an LLM.
- GenerateTextRetrievalData which allows creating text retrieval data with an LLM.
- GenerateShortTextMatchingData which allows creating short texts matching the input data.
- GenerateLongTextMatchingData which allows creating long texts matching the input data.
- GenerateTextClassificationData which allows creating text classification data from the input data.
- MonolingualTripletGenerator which allows creating monolingual triplets from the input data.
- BitextRetrievalGenerator which allows creating bitext retrieval data from the input data.

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

We've added a few new steps allowing to load data from different sources:

LoadDataFromDisk allows loading a Distisetor datasets.Dataset that was previously saved using the save_to_disk method.
LoadDataFromFileSystem allows loading a datasets.Dataset from a file system.

Thanks to @rasdani for helping us testing this new tasks!

In addition, we have added save_to_disk method to Distiset akin to datasets.Dataset.save_to_disk, that allows saving the generated distiset to disk, along with the pipeline.yaml and pipeline.log.

`save_to_disk` example

from distilabel.pipeline import Pipeline

with Pipeline(name="my-pipeline") as pipeline:
    ...
    
if __name__ == "__main__":
    distiset = pipeline.run(...)
    distiset.save_to_disk(dataset_path="my-distiset")

`MixtureOfAgentsLLM` implementation

We've added a new LLM called MixtureOfAgentsLLM derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM allows generating improved outputs thanks to the collective expertise of several LLMs.

`MixtureOfAgentsLLM` example

from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM

llm = MixtureOfAgentsLLM(
    aggregator_llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    ),
    proposers_llms=[
        InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
        InferenceEndpointsLLM(
            model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
            tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
        ),
        InferenceEndpointsLLM(
            model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
            tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
        ),
    ],
    rounds=2,
)

llm.load()

output = llm.generate(
    inputs=[
        [
            {
                "role": "user",
                "content": "My favorite witty review of The Rings of Power series is this: Input:",
            }
        ]
    ]
)

Saving cache and passing batches to `GlobalStep`s optimizations

The cache logic of the _BatchManager has been improved to incrementally update the cache making the process much faster.
The data of the input batches of the GlobalSteps will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration of fsspec, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.

`BasePipeline` and `_BatchManager` refactor

The logic around BasePipeline and _BatchManager has been refactored, which will make it easier to implement new pipelines in the future.

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark

distilabel can be easily used to create an LLM benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel: Arena Hard

📚 Improved documentation structure

We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.

What's Changed

Add prometheus.md by @alvarobartt in #656
Reduce time required to execute _cache method by @gabrielmbmb in #672
[DOCS] Update theme styles and images by @leiyre in #667
Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
Add CITATION.cff by @alvarobartt in #677
Deprecate conversation support in TextGeneration in favour of ChatGeneration by @alvarobartt in #676
Add functionality to load/save distisets to/from disk by @plaguss in #673
Integration instructor by @plaguss in #654
Fix docs of saving/loading distiset from disk by @plaguss in https://githu...

Contributors

leiyre, davidberenstein1957, and 4 other contributors

Assets 2

Releases: argilla-io/distilabel

1.4.1

What's Changed

Contributors

1.4.0

✨ Release highlights

Offline Batch Generation and OpenAI Batch API

Improved cache for maximum outputs reusability

Steps can generated artifacts

New Tasks: CLAIR, APIGEN and many more!

New Steps to sample data in your pipelines and remove duplicates

Generate text embeddings using vLLM

Extra things

What's Changed

Contributors

1.3.2

What's Changed

Contributors

1.3.1

What's Changed

New Contributors

Contributors

1.3.0

What's Changed

New Contributors

Contributors

1.2.4

What's Changed

Contributors

1.2.3

What's Changed

New Contributors

Contributors

1.2.2

What's Changed

Contributors

1.2.1

What's Changed

New Contributors

Contributors

1.2.0

✨ Release highlights

Structured generation with instructor, InferenceEndpointsLLM now supports structured generation and StructuredGeneration task

New tasks for generating datasets for training embedding models

New Steps for loading data from different sources and saving/loading Distiset to disk

MixtureOfAgentsLLM implementation

Saving cache and passing batches to GlobalSteps optimizations

BasePipeline and _BatchManager refactor

Added ArenaHard as an example of how to use distilabel to implement a benchmark

📚 Improved documentation structure

What's Changed

Contributors

New `Tasks`: `CLAIR`, `APIGEN` and many more!

Generate text embeddings using `vLLM`

Structured generation with `instructor`, `InferenceEndpointsLLM` now supports structured generation and `StructuredGeneration` task

New `Step`s for loading data from different sources and saving/loading `Distiset` to disk

`MixtureOfAgentsLLM` implementation

Saving cache and passing batches to `GlobalStep`s optimizations

`BasePipeline` and `_BatchManager` refactor

Added `ArenaHard` as an example of how to use `distilabel` to implement a benchmark