Clustering using LLM's #24

kitsamho · 2023-05-31T20:15:31Z

kitsamho
May 31, 2023

It would be super cool to see some form of sklearn clustering / topic modelling implementation where the inputs are texts and outputs are clusters/topics in the data.

OKUA1 · 2023-05-31T22:17:52Z

OKUA1
May 31, 2023
Collaborator

@kitsamho, I think this is a very interesting idea. Could you elaborate a bit more on how you envision it to work?

Just doing the topic extraction does not seem to be too complicated, but in this case the model might produce similar, but not identical topics which in extreme cases might result in n_clusters = n_samples.

So maybe the n_clusters could be a hyperparameter and the algorithm works in 2 steps:

Topic extraction -> produces a topic per sample
Topic clustering -> groups topics by their similarity into n clusters and produces a "central" topic for each cluster

Of course an alternative could be to pack all samples into a single prompt and specify the constraint on the number of topics right away, but this will only work for relatively small datasets.

What do you think ? Or maybe you have a specific dataset in mind on which you could demonstrate the desired result ?

1 reply

kitsamho Jun 1, 2023
Author

@OKUA1 let me think on this!

jorgeston · 2023-06-14T04:47:09Z

jorgeston
Jun 14, 2023

Maybe a good idea is using GPTSummarizer model with max_words=3 (or 2, 1) in order to get "topics", .i.e., a specific word o phrase which summarizes the idea of text in few words.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering using LLM's #24

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Clustering using LLM's #24

kitsamho May 31, 2023

Replies: 2 comments · 1 reply

OKUA1 May 31, 2023 Collaborator

kitsamho Jun 1, 2023 Author

jorgeston Jun 14, 2023

kitsamho
May 31, 2023

Replies: 2 comments 1 reply

OKUA1
May 31, 2023
Collaborator

kitsamho Jun 1, 2023
Author

jorgeston
Jun 14, 2023