Releases: argilla-io/distilabel
1.4.1
What's Changed
- Fix not handling list of all primitive types in
SignatureMixin
by @gabrielmbmb in #1037
Full Changelog: 1.4.0...1.4.1
1.4.0
✨ Release highlights
Offline Batch Generation and OpenAI Batch API
We’ve updated the LLM
interface so now LLM
s using an external platform that offers a batch service can be integrated in distilabel
. In addition, OpenAILLM
has been updated so it can use the OpenAI Batch API to get 50% cost reductions.
distilabel-offline-batch-generation.mp4
Improved cache for maximum outputs reusability
We all know that running LLM
is costly and most of the times we want to reuse as much as we can the outputs generated with them. Before this release, distilabel
cache mechanism enabled to recover a pipeline execution that was stopped before finishing and to re-create the Distiset
generated by one that finished its execution and was re-executed.
In this release, we've greatly improved the cache so the outputs of all the Step
s are cached and therefore can be reused in other pipelines executions even if the pipeline has changed:
In addition, we've added a use_cache
attribute in the Step
s that allows toggling the use of the cache at step level.
Steps can generated artifacts
In some cases, Step
produces some additional artifacts that are used to generate its outputs. These artifacts can take some time to be generated and they could be reused in the future. That’s why we’ve added a new method called Step.save_artifact
that can be called within the step to store artifacts generated by it. The artifacts generated by the Step
will also get uploaded to the Hugging Face Hub.
from typing import List, TYPE_CHECKING
from distilabel.steps import GlobalStep, StepInput, StepOutput
import matplotlib.pyplot as plt
if TYPE_CHECKING:
from distilabel.steps import StepOutput
class CountTextCharacters(GlobalStep):
@property
def inputs(self) -> List[str]:
return ["text"]
@property
def outputs(self) -> List[str]:
return ["text_character_count"]
def process(self, inputs: StepInput) -> "StepOutput": # type: ignore
character_counts = []
for input in inputs:
text_character_count = len(input["text"])
input["text_character_count"] = text_character_count
character_counts.append(text_character_count)
# Generate plot with the distribution of text character counts
plt.figure(figsize=(10, 6))
plt.hist(character_counts, bins=30, edgecolor="black")
plt.title("Distribution of Text Character Counts")
plt.xlabel("Character Count")
plt.ylabel("Frequency")
# Save the plot as an artifact of the step
self.save_artifact(
name="text_character_count_distribution",
write_function=lambda path: plt.savefig(path / "figure.png"),
metadata={"type": "image", "library": "matplotlib"},
)
plt.close()
yield inputs
New Tasks
: CLAIR
, APIGEN
and many more!
- New CLAIR task: CLAIR uses an AI system to minimally revise a solution A→A´ such that the resulting preference A
preferred
A’ is much more contrastive and precise. - New tasks to replicate APIGen framework:
APIGenGenerator
,APIGenSemanticChecker
,APIGenExecutionChecker
. These tasks allow generating datasets like the one presented in the paper: APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets - New URIAL task that allows using non-instruct models to generate a response for an instruction.
- New TextClassification task to make zero-shot text classification based on a predefined but highly customizable prompt.
- TextClustering, to generate clusters from text and group your generations, discovering labels from your data. Comes with 2 steps to run UMAP and DBSCAN algorithms.
- Updated TextGeneration to simplify customization of tasks that don’t require further post-processing.
New Steps to sample data in your pipelines and remove duplicates
- New DataSampler step to sample data from other datasets, which can be useful to inject different examples for few-shot examples in your prompts.
- New EmbeddingDedup step to remove duplicates based on embeddings and a distance metric.
- New MinHashDedup step to remove near duplicates from the text based on MinHash and MinHashLSH algorithm.
- New TruncateTextColumns to truncate the length of your texts using either the character length or the number of tokens based on a tokenizer.
- New CombineOutputs to combine the outputs of two or more steps into a single output.
Generate text embeddings using vLLM
- Now you can generate embeddings using vLLMEmbeddings!
Extra things
- Easily visualize the tasks’ prompts using Task.print method.
- New use_default_structured_outputs flag in tasks to automatically use structured generation in some tasks that can benefit from it.
What's Changed
- Make
ClientvLLM.model_name
acached_property
by @gabrielmbmb in #862 - Pass dataset to dry_run method by @plaguss in #863
- Add default structured output for
GenerateSentencePair
task by @plaguss in #868 - Complexity scorer default structured output by @plaguss in #870
- Quality scorer default structured output by @plaguss in #873
- Ultrafeedback default structured output by @plaguss in #876
- Remove use of
default_chat_template
by @gabrielmbmb in #888 - Temporary fix for installing
llama-cpp-python
by @gabrielmbmb in #886 - Fix unit tests after release of
transformers==4.44.0
by @gabrielmbmb in #891 - Fix default structured output by @plaguss in #892
- Send as many batches as possible to input queues by @gabrielmbmb in #895
- Exclude
repo_id
fromLoadDataFromFileSystem
by @plaguss in #898 - Fix loader to read from a glob pattern by @plaguss in #877
- Add
save_artifact
method to_Step
by @gabrielmbmb in #871 - Add new
add_raw_input
argument to_Task
so we can automatically include the formatted input by @plaguss in #903 - New
TruncateTextColumn
to truncate the length of texts using the number of tokens or characters by @plaguss in #902 - Update
inputs
andoutputs
interface to allow returning dict indicating optionality by @gabrielmbmb in #883 - Update mistrallm by @plaguss in #904
- Deepseek prover by @plaguss in #907
- Update
RewardModelScore.inputs
property by @gabrielmbmb in #908 - Add tutorial - generate data for training embeddings and reranking models by @davidberenstein1957 in #893
- Fix load data from disk by @plaguss in #910
- docs: minor fixes by @davidberenstein1957 in #913
- Add
URIAL
task by @gabrielmbmb in #921 - Add
vLLMEmbeddings
by @plaguss in #920 - docs: add tutorials preference and clean by @sdiazlor in #917
- Fix
StructuredGeneration
examples and internal check by @plaguss in #912 - Generate deterministic pipeline name when it's not given by @plaguss in #878
- Add custom errors by @plaguss in #911
- Docs/tutorials fix by @sdiazlor in #922
- Add
revision
runtime parameter toLoadDataFromHub
by @gabrielmbmb in #928 - Add plausible as replacement for GA by @davidberenstein1957 in #929
- Add minhash related steps to deduplicate texts by @plaguss in #931
- docs: API reference review by @sdiazlor in #932
- Refactor of MinHash to work with a single class and fix the shelve backend by @plaguss in #937
- Update
make_generator_step
to set pipeline to step and add edge to steps in trophic level 1 by @gabrielmbmb in https://g...
1.3.2
What's Changed
- Deepseek prover task by @plaguss in #733
- Do not cancel in progress docs workflows by @gabrielmbmb in #919
- Fix creating Ray placement groups for vLLM by @gabrielmbmb in #918
- Fix passing
base_url
inmodel_id
inInferenceEndpointsLLM
by @gabrielmbmb in #924
Full Changelog: 1.3.1...1.3.2
1.3.1
What's Changed
- Create new
distilabel.constants
module to store constants and avoid circular imports by @plaguss in #861 - Add OpenAI request timeout by @ashim-mahara in #858
New Contributors
- @ashim-mahara made their first contribution in #858
Full Changelog: 1.3.0...1.3.1
1.3.0
What's Changed
- Add new step
CombineKeys
by @plaguss in #747 - Refactor naming columns steps combinecolumns combinekeys expandcolumns by @davidberenstein1957 in #758
- Drop remove deprecated
LoadHubDataset
by @davidberenstein1957 in #759 - Add
requirements
list forPipeline
by @plaguss in #720 - Add
StepResources
and step replicas inPipeline
by @gabrielmbmb in #750 - Add load stages by @gabrielmbmb in #760
- Update min required version to
python==3.9
by @gabrielmbmb in #770 - Optionally include the pipeline script in the hub when pushing your distiset by @plaguss in #762
- Add
docs-pr.yml
anddocs-pr-close.yml
workflows by @gabrielmbmb in #774 - Add
RayPipeline
class by @gabrielmbmb in #769 - Fixed closed PR workflow by @gabrielmbmb in #776
- Add
Magpie
andMagpieGenerator
tasks by @gabrielmbmb in #778 - Fix some issues related to
Magpie
task by @gabrielmbmb in #783 - Add
end_with_user
andinclude_system_prompt
flags toMagpie
tasks and handleNone
s. by @gabrielmbmb in #784 - Add workflow concurrency group for publishing docs by @gabrielmbmb in #796
- Add
_desired_num_gpus
attribute toCudaDevicePlacementMixin
by @gabrielmbmb in #795 - Compatibility with
vLLM
withtensor_parallel_size
argument by @gabrielmbmb in #805 - Update default names in
GroupColumns
by @plaguss in #808 - Request batches to
GeneratorStep
if only step in pipeline by @gabrielmbmb in #828 - Add default name for a pipeline by @plaguss in #809
- Update distilabel phrasing based on PR hugging face hub by @davidberenstein1957 in #821
- Some more
Magpie
improvements by @gabrielmbmb in #833 - Add
Embeddings
base class,SentenceTransformerEmbeddings
class,EmbeddingGeneration
andFaissNearestNeighbour
steps by @gabrielmbmb in #830 - Create file per hostname in
CudaDevicePlacementMixin
by @gabrielmbmb in #814 - Create a
GeneratorStep
from a dataset using a helper function by @plaguss in #812 - Do not take into account
disable_cuda_device_placement
for pipeline signature by @gabrielmbmb in #838 - Add
RewardModelScore
step by @gabrielmbmb in #840 - Fix
LoadDataFromHub
attribute_dataset
hadellipsis
by default instead ofNone
by @gabrielmbmb in #841 - Create
PlacementGroup
for steps usingvLLM
by @gabrielmbmb in #842 - Update
argilla
integration to useargilla_sdk
v2 by @alvarobartt in #705 - Make
overall-rating
the default aspect forUltraFeedback
task by @gabrielmbmb in #843 - fix typo index.md by @franperic in #844
- Use
CudaDevicePlacementMixin
inRewardModelScore
step by @gabrielmbmb in #845 - Gather GPUs per Ray node to create placement groups by @gabrielmbmb in #848
- Fix typo in docs by @plaguss in #850
- Add
xfail
routing batch function tests by @gabrielmbmb in #852 - Fix creating placement group when
pipeline_parallel_size>1
by @gabrielmbmb in #851 - docs: 846 docs include google analytics by @davidberenstein1957 in #847
- Add
ClientvLLM
class by @gabrielmbmb in #854 - Add hard-negative flag to include similar challenging negatives on triplets by @plaguss in #856
- Add bibtex references in the docstrings to be shown in the README by @plaguss in #855
- distilabel
1.3.0
by @gabrielmbmb in #857
New Contributors
- @franperic made their first contribution in #844
Full Changelog: 1.2.4...1.3.0
1.2.4
What's Changed
- Update
InferenceEndpointsLLM
to usechat_completion
method by @gabrielmbmb in #815
Full Changelog: 1.2.3...1.2.4
1.2.3
What's Changed
- Fix Import Error for KeepColumns in instruction_backtranslation.md (Issue #785) by @Hassaan-Qaisar in #786
- Correct variable name in dataset push example (in ultrafeedback.md file) (Issue #787) by @Hassaan-Qaisar in #791
- docs: update script for issue dashboard by @sdiazlor in #775
- Fix 404 model not found for private Serverless IE by @dvsrepo in #806
New Contributors
- @Hassaan-Qaisar made their first contribution in #786
Full Changelog: 1.2.2...1.2.3
1.2.2
What's Changed
- Fix passing
input
toformat_output
function by @gabrielmbmb in #781
Full Changelog: 1.2.1...1.2.2
1.2.1
What's Changed
- Fix docs for distiset.save_to_disk kwargs by @fpreiss in #745
- docs: change references by @sdiazlor in #754
- Fix
response_format
forTogetherLLM
andAnyScaleLLM
by @gabrielmbmb in #764
New Contributors
Full Changelog: 1.2.0...1.2.1
1.2.0
✨ Release highlights
Structured generation with instructor
, InferenceEndpointsLLM
now supports structured generation and StructuredGeneration
task
-
instructor
has been integrated bringing support for structured generation withOpenAILLM
,AnthropicLLM
,LiteLLM
,MistralLLM
,CohereLLM
andGroqLLM
:Structured generation with `instructor` example
from typing import List from distilabel.llms import MistralLLM from distilabel.pipeline import Pipeline from distilabel.steps import LoadDataFromDicts from distilabel.steps.tasks import TextGeneration from pydantic import BaseModel, Field class Node(BaseModel): id: int label: str color: str class Edge(BaseModel): source: int target: int label: str color: str = "black" class KnowledgeGraph(BaseModel): nodes: List[Node] = Field(..., default_factory=list) edges: List[Edge] = Field(..., default_factory=list) with Pipeline( name="Knowledge-Graphs", description=( "Generate knowledge graphs to answer questions, this type of dataset can be used to " "steer a model to answer questions with a knowledge graph." ), ) as pipeline: sample_questions = [ "Teach me about quantum mechanics", "Who is who in The Simpsons family?", "Tell me about the evolution of programming languages", ] load_dataset = LoadDataFromDicts( name="load_instructions", data=[ { "system_prompt": "You are a knowledge graph expert generator. Help me understand by describing everything as a detailed knowledge graph.", "instruction": f"{question}", } for question in sample_questions ], ) text_generation = TextGeneration( name="knowledge_graph_generation", llm=MistralLLM( model="open-mixtral-8x22b", structured_output={"schema": KnowledgeGraph} ), ) load_dataset >> text_generation
-
InferenceEndpointsLLM
now supports structured generation -
New
StructuredGeneration
task that allows defining the schema of the structured generation per input row.
New tasks for generating datasets for training embedding models
sentence-transformers
v3 was recently released and we couldn't resist the urge of adding a few new tasks to allow creating datasets for training embedding models!
- New
GenerateSentencePair
task that allows to generate apositive
sentence for an inputanchor
, and optionally also anegative
sentence. The tasks allows creating different kind of data specifying theaction
to perform with respect to theanchor
: paraphrasing, generate semantically-similar sentence, generate a query or generate an answer. - Implemented Improving Text Embeddings with Large Language Models and adding the following tasks derived from the paper:
EmbeddingTaskGenerator
which allows generating new embedding-related tasks using anLLM
.GenerateTextRetrievalData
which allows creating text retrieval data with anLLM
.GenerateShortTextMatchingData
which allows creating short texts matching the input data.GenerateLongTextMatchingData
which allows creating long texts matching the input data.GenerateTextClassificationData
which allows creating text classification data from the input data.MonolingualTripletGenerator
which allows creating monolingual triplets from the input data.BitextRetrievalGenerator
which allows creating bitext retrieval data from the input data.
New Step
s for loading data from different sources and saving/loading Distiset
to disk
We've added a few new steps allowing to load data from different sources:
LoadDataFromDisk
allows loading aDistiset
ordatasets.Dataset
that was previously saved using thesave_to_disk
method.LoadDataFromFileSystem
allows loading adatasets.Dataset
from a file system.
Thanks to @rasdani for helping us testing this new tasks!
In addition, we have added save_to_disk
method to Distiset
akin to datasets.Dataset.save_to_disk
, that allows saving the generated distiset to disk, along with the pipeline.yaml
and pipeline.log
.
`save_to_disk` example
from distilabel.pipeline import Pipeline
with Pipeline(name="my-pipeline") as pipeline:
...
if __name__ == "__main__":
distiset = pipeline.run(...)
distiset.save_to_disk(dataset_path="my-distiset")
MixtureOfAgentsLLM
implementation
We've added a new LLM
called MixtureOfAgentsLLM
derived from the paper Mixture-of-Agents Enhances Large Language Model Capabilities. This new LLM
allows generating improved outputs thanks to the collective expertise of several LLM
s.
`MixtureOfAgentsLLM` example
from distilabel.llms import MixtureOfAgentsLLM, InferenceEndpointsLLM
llm = MixtureOfAgentsLLM(
aggregator_llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
proposers_llms=[
InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
),
InferenceEndpointsLLM(
model_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
tokenizer_id="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
),
InferenceEndpointsLLM(
model_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
tokenizer_id="HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1",
),
],
rounds=2,
)
llm.load()
output = llm.generate(
inputs=[
[
{
"role": "user",
"content": "My favorite witty review of The Rings of Power series is this: Input:",
}
]
]
)
Saving cache and passing batches to GlobalStep
s optimizations
- The cache logic of the
_BatchManager
has been improved to incrementally update the cache making the process much faster. - The data of the input batches of the
GlobalStep
s will be passed to the step using the file system, as this is faster than passing it using the queue. This is possible thanks to new integration offsspec
, which can be configured to use a file system or cloud storage as backend for passing the data of the batches.
BasePipeline
and _BatchManager
refactor
The logic around BasePipeline
and _BatchManager
has been refactored, which will make it easier to implement new pipelines in the future.
Added ArenaHard
as an example of how to use distilabel
to implement a benchmark
distilabel
can be easily used to create an LLM
benchmark. To showcase this, we decided to implement Arena Hard as an example: Benchmarking with distilabel
: Arena Hard
📚 Improved documentation structure
We have updated the documentation structure to make it more clear and self-explanatory, as well as more visually appealing 😏.
What's Changed
- Add
prometheus.md
by @alvarobartt in #656 - Reduce time required to execute
_cache
method by @gabrielmbmb in #672 - [DOCS] Update theme styles and images by @leiyre in #667
- Fix circular import due to DISTILABEL_METADATA_KEY by @plaguss in #675
- Add
CITATION.cff
by @alvarobartt in #677 - Deprecate conversation support in
TextGeneration
in favour ofChatGeneration
by @alvarobartt in #676 - Add functionality to load/save distisets to/from disk by @plaguss in #673
- Integration instructor by @plaguss in #654
- Fix docs of saving/loading distiset from disk by @plaguss in https://githu...