Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TextClassification, UMAP, DBSCAN and TextClustering tasks #948

Merged
merged 27 commits into from
Sep 16, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Sep 5, 2024

Description

This PR adds a general TextClassification task, which can be useful for both label and multilabel classification.

Also UMAP, DBSCAN and TextClustering, which are quite related to each other, to generate clusters of text.

It was defined as part of a pipeline for text clustering, so may be biased towards that definition, but should be general enough to work on any common text classification tasks.

TODO. Create an example in the docs of a full text clustering pipeline, including the label inference.

@plaguss plaguss requested review from gabrielmbmb and removed request for gabrielmbmb September 5, 2024 09:26
@plaguss plaguss self-assigned this Sep 5, 2024
@plaguss plaguss added the enhancement New feature or request label Sep 5, 2024
@plaguss plaguss linked an issue Sep 5, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Sep 5, 2024

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-948/

Copy link

codspeed-hq bot commented Sep 5, 2024

CodSpeed Performance Report

Merging #948 will not alter performance

Comparing text-clustering (c0cbe15) with develop (28ecbc4)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss changed the title Add TextClassification task Add TextClassification, UMAP, DBSCAN and TextClustering tasks Sep 9, 2024
@plaguss plaguss added this to the 1.4.0 milestone Sep 10, 2024
@plaguss plaguss marked this pull request as ready for review September 11, 2024 13:05
Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

src/distilabel/steps/clustering/dbscan.py Show resolved Hide resolved
src/distilabel/steps/clustering/umap.py Show resolved Hide resolved
@plaguss plaguss merged commit f0067b8 into develop Sep 16, 2024
5 of 6 checks passed
@plaguss plaguss deleted the text-clustering branch September 16, 2024 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] add TextClassificationLabeler Task
2 participants