Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update _langchain.py with [KEYWORDS] tag option #1871

Merged
merged 2 commits into from
Apr 1, 2024

Conversation

mcantimmy
Copy link
Contributor

Update langchain representation model with the inclusion to utilize topics keywords using the tag '[KEYWORDS]' in the prompt.

@MaartenGr
Copy link
Owner

Thanks for the PR! Could you perhaps make a brief mention in the docstrings that it only uses [KEYWORDS] (if used) and not [DOCUMENTS] like all other representation models do?

update docstring with KEYWORD tag use
@mcantimmy
Copy link
Contributor Author

Okay, I added a mention, let me know if that works.

@MaartenGr
Copy link
Owner

Awesome, LGTM! Quick question, have you tested how it works with vs. without the ["KEYWORDS"] tag? I think it would be nice to see the effect of this PR before merging it.

@mcantimmy
Copy link
Contributor Author

@MaartenGr

I don't have extensive testing, but:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from bertopic.representation import LangChain
from bertopic.vectorizers import ClassTfidfTransformer
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from bertopic.representation import MaximalMarginalRelevance

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
cats = ['sci.crypt',
        'sci.electronics',
        'sci.med',
        'sci.space']
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'), categories=cats, random_state=40)['data']
docs_1500 = docs
embeddings = embedding_model.encode(docs_1500, show_progress_bar=True)

vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
my_openai_api_key = "---"
chain = load_qa_chain(OpenAI(temperature=0, openai_api_key=my_openai_api_key), chain_type="stuff")

# Create your representation model
#prompt = "What are the preceding documents about? Provide a single label in less than 5 words."
prompt = "Given the associated [KEYWORDS], what are the preceding documents about? Provide a single topic label in less than 5 words."
representation_model = {'MMR': MaximalMarginalRelevance(diversity=0.7),
                        'LLM': LangChain(chain, prompt=prompt, nr_docs=3, doc_length=300, tokenizer='vectorizer')}
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(embedding_model=embedding_model,
                       ctfidf_model=ctfidf_model,
                       nr_topics=8,
                       min_topic_size=10,
                       vectorizer_model=vectorizer_model,
                       representation_model=representation_model,
                       verbose=True)
topics, probs = topic_model.fit_transform(docs_1500, embeddings)

Topics with keywordss:

Topic Count Name MMR LLM
-1 1557 -1_use_edu_like_people ['use', 'edu', 'like', 'people', 'com', 'know', 'don', 'time', 'just', 'government'] ['Typing injuries and related information.', '', '', '', '', '', '', '', '', '']
0 1113 0_key_db_chip_encryption ['key', 'db', 'chip', 'encryption', 'clipper', 'keys', 'des', 'use', 'bit', 'phone'] ['Encryption and key management.', '', '', '', '', '', '', '', '', '']
1 524 1_space_nasa_earth_spacecraft ['space', 'nasa', 'earth', 'spacecraft', 'orbit', 'mission', 'shuttle', 'solar', 'moon', 'venus'] ['Planetary Probes and Space Exploration', '', '', '', '', '', '', '', '', '']
2 518 2_medical_cancer_hiv_patients ['medical', 'cancer', 'hiv', 'patients', 'disease', 'health', '92', 'msg', 'aids', 'food'] ['Medical information and resources.', '', '', '', '', '', '', '', '', '']
3 123 3_kirlian_photography_science_blue ['kirlian', 'photography', 'science', 'blue', 'eye', 'uv', 'kirlian photography', 'methodology', 'leaf', 'scientific'] ['Science and Kirlian Photography', '', '', '', '', '', '', '', '', '']
4 53 4_printer_ear_ink_deskjet ['printer', 'ear', 'ink', 'deskjet', 'laser', 'printers', 'hp', 'wax', 'toner', 'laser printers'] ['Printer technology', '', '', '', '', '', '', '', '', '']
5 49 5_battery_water_concrete_cooling ['battery', 'water', 'concrete', 'cooling', 'steam', 'heat', 'temperature', 'discharge', 'batteries', 'towers'] ['Lead acid battery discharge.', '', '', '', '', '', '', '', '', '']
6 15 6_ground_grounding_neutral_conductor ['ground', 'grounding', 'neutral', 'conductor', 'grounding conductor', 'wire', 'connected', 'panel', 'grounded', 'current'] ['Electrical wiring and grounding practices.', '', '', '', '', '', '', '', '', '']

Topics without keywords:

Topic Count MMR LLM
-1 1466 ['db', 'use', 'edu', 'like', 'space', 'don', 'know', 'time', 'com', 'people'] ['Commercial space news and technology.', '', '', '', '', '', '', '', '', '']
0 585 ['space', 'nasa', 'earth', 'spacecraft', 'mission', 'solar', 'orbit', 'venus', 'moon', 'shuttle'] ['Space mission FAQs and acronyms.', '', '', '', '', '', '', '', '', '']
1 578 ['use', 'amp', 'power', 'radio', 'just', 'voltage', 'output', 'like', 've', 'line'] ['Audio switches and interference.', '', '', '', '', '', '', '', '', '']
2 558 ['cancer', 'medical', 'hiv', 'health', 'patients', 'disease', '92', 'food', 'doctor', 'aids'] ['Medical information and resources.', '', '', '', '', '', '', '', '', '']
3 547 ['key', 'encryption', 'keys', 'government', 'chip', 'clipper', 'security', 'des', 'privacy', 'law'] ['Government and industry collaboration.', '', '', '', '', '', '', '', '', '']
4 154 ['thanks lot', 'thanks', 'yxy4145', 'yxy4145 usl', 'quoting', 'edu thanks', 'usl edu', 'usl', '1993apr26', 'article 1993apr26'] ['Email exchange about paper reference.', '', '', '', '', '', '', '', '', '']
5 49 ['battery', 'water', 'concrete', 'cooling', 'steam', 'heat', 'temperature', 'discharge', 'batteries', 'towers'] ['Lead acid battery chemistry.', '', '', '', '', '', '', '', '', '']
6 15 ['ground', 'grounding', 'neutral', 'conductor', 'grounding conductor', 'wire', 'connected', 'panel', 'grounded', 'current'] ['Electrical wiring and grounding practices.', '', '', '', '', '', '', '', '', '']

Do you need anything more?

@MaartenGr
Copy link
Owner

Thanks for testing! Could you perhaps do one last test using the same example but making sure that the same topics are created? You can do so by fixing the seed of UMAP. That way, it will be a fair comparison between two identical topics that only have different representations.

@mcantimmy
Copy link
Contributor Author

@MaartenGr

Sure, included this umap model:

umap_model = UMAP(n_neighbors=15, n_components=5,
                  min_dist=0.0, metric='cosine', random_state=42)

With Keywords:

Topic Count MMR LLM
-1 1539 ['use', 'edu', 'like', 'new', 'just', 'know', 'don', 'information', 'time', 'com'] ['Internet privacy and anonymity.', '', '', '', '', '', '', '', '', '']
0 573 ['key', 'encryption', 'keys', 'chip', 'des', 'government', 'clipper', 'security', 'people', 'bit'] ['Encryption technology and government involvement.', '', '', '', '', '', '', '', '', '']
1 565 ['space', 'nasa', 'shuttle', 'orbit', 'earth', 'mission', 'science', 'moon', 'launch', 'lunar'] ['Space exploration', '', '', '', '', '', '', '', '', '']
2 533 ['db', 'mov', 'bh', 'si', 'cs', 'al', 'byte', 'use', 'power', 'cs si'] ['Computer hardware upgrades', '', '', '', '', '', '', '', '', '']
3 513 ['cancer', 'medical', '92', 'msg', 'food', 'patients', 'doctor', 'diet', 'treatment', 'candida'] ['Medical treatments for cancer.', '', '', '', '', '', '', '', '', '']
4 127 ['reversed good', 'hello got', 'got reversed', 'good exit', 'exit does', 'know tell', 'reversed', 'exit', 'hello', 'does know'] ['Reversing document order.', '', '', '', '', '', '', '', '', '']
5 87 ['battery', 'water', 'concrete', 'cooling', 'heat', 'temperature', 'steam', 'discharge', 'batteries', 'towers'] ['Battery discharge and cooling towers.', '', '', '', '', '', '', '', '', '']
6 15 ['ground', 'grounding', 'neutral', 'conductor', 'grounding conductor', 'wire', 'connected', 'panel', 'grounded', 'current'] ['Electrical wiring and grounding practices.', '', '', '', '', '', '', '', '', '']

Without keywords:

Topic Count MMR LLM
-1 1539 ['use', 'edu', 'like', 'new', 'just', 'know', 'don', 'information', 'time', 'com'] ['Internet privacy and anonymity.', '', '', '', '', '', '', '', '', '']
0 573 ['key', 'encryption', 'keys', 'chip', 'des', 'government', 'clipper', 'security', 'people', 'bit'] ['Public key encryption and privacy.', '', '', '', '', '', '', '', '', '']
1 565 ['space', 'nasa', 'shuttle', 'orbit', 'earth', 'mission', 'science', 'moon', 'launch', 'lunar'] ['Space missions and membership information.', '', '', '', '', '', '', '', '', '']
2 533 ['db', 'mov', 'bh', 'si', 'cs', 'al', 'byte', 'use', 'power', 'cs si'] ['"Technical documents for printer upgrades"', '', '', '', '', '', '', '', '', '']
3 513 ['cancer', 'medical', '92', 'msg', 'food', 'patients', 'doctor', 'diet', 'treatment', 'candida'] ['Medical guidelines and resources.', '', '', '', '', '', '', '', '', '']
4 127 ['reversed good', 'hello got', 'got reversed', 'good exit', 'exit does', 'know tell', 'reversed', 'exit', 'hello', 'does know'] ['Reversing document order.', '', '', '', '', '', '', '', '', '']
5 87 ['battery', 'water', 'concrete', 'cooling', 'heat', 'temperature', 'steam', 'discharge', 'batteries', 'towers'] ['Chemical reactions in batteries.', '', '', '', '', '', '', '', '', '']
6 15 ['ground', 'grounding', 'neutral', 'conductor', 'grounding conductor', 'wire', 'connected', 'panel', 'grounded', 'current'] ['Electrical wiring and grounding practices.', '', '', '', '', '', '', '', '', '']

@MaartenGr
Copy link
Owner

Awesome, it seems that there is a small difference between representations but that also be explained by simply the temperature since the differences seem minor. I can imagine that the difference becomes larger if somehow the documents are less representative or if there are more domain-specific keywords used that might not be present in the main documents.

LGTM! Thank you for your work on this!

@MaartenGr MaartenGr merged commit 424cefc into MaartenGr:master Apr 1, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants