Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PartOfSpeech representation reproducibility and word with index 0 #1981

Closed
Greenpp opened this issue May 8, 2024 · 4 comments
Closed

PartOfSpeech representation reproducibility and word with index 0 #1981

Greenpp opened this issue May 8, 2024 · 4 comments

Comments

@Greenpp
Copy link
Contributor

Greenpp commented May 8, 2024

Hi Maarten

Problem

Part 1

While working with BERTopic, I encountered a problem with reproducibility of representations. I made sure to set random_state wherever possible. After reviewing all similar issues and trying things like disabling MST approximation in HDBSCAN (approx_min_span_tree=False) or setting global random state with numpy (numpy.random.seed), I started digging into the library.

I found that the values remained constant until the PartOfSpeech representation module and switching it to another one resolved the issue. The problem appears to be initially caused by the deduplication method (list(set())) used at lines 121 and 130. Because the hash function used for generating set keys is seeded at the start of the interpreter (seed can be overridden using PYTHONHASHSEED env variable), the output of such deduplication is different with each run. This behavior causes word_indices at line 144 to change with each run. Which is later problematic when sorting keywords with the same c-TF-IDF as they are arranged differently.

Part 2

When looking at the PoS code, I noticed that word_indices at line 144 are generated using the following condition if words_lookup.get(keyword) which ignores the first word returned by get_feature_names_out. It looks like an error.

MRE

from bertopic import BERTopic
from bertopic.representation import PartOfSpeech
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'), random_state=42)['data'][:100]

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))
representation_model = PartOfSpeech()
topic_model = BERTopic(
    umap_model=umap_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model
)
topic_model.fit_transform(docs)

topic_model.get_topic_info()

Running this example can produce different representations e.g. 2_season_hockey_player_active, 2_season_hockey_active_player, as player and active both have c-TF-IDF of 0.009478917304532486.

Solution

As the contribution guide suggests starting with an issue, I will post my suggestions here.

Part 1

  1. Sort the word_indices at line 144 using numpy. This will ensure consistent ordering of words, should be faster than built-in sort, and will transform them into numpy array for further operations.
  2. Remove the numpy array creation at line 145, as its handled by previous step.

Part 2

  1. Change the if words_lookup.get(keyword) condition to if keyword in words_lookup.
@MaartenGr
Copy link
Owner

I found that the values remained constant until the PartOfSpeech representation module and switching it to another one resolved the issue. The problem appears to be initially caused by the deduplication method (list(set())) used at lines 121 and 130. Because the hash function used for generating set keys is seeded at the start of the interpreter (seed can be overridden using PYTHONHASHSEED env variable), the output of such deduplication is different with each run. This behavior causes word_indices at line 144 to change with each run. Which is later problematic when sorting keywords with the same c-TF-IDF as they are arranged differently.

Amazing, great catch! That's also a nasty habit of mine so I wouldn't be surprised if that happens in other places as well.

Part 1
Sort the word_indices at line 144 using numpy. This will ensure consistent ordering of words, should be faster than built-in sort, and will transform them into numpy array for further operations.
Remove the numpy array creation at line 145, as its handled by previous step.

Sounds good and a minimal change as well, which I prefer!

When looking at the PoS code, I noticed that word_indices at line 144 are generated using the following condition if words_lookup.get(keyword) which ignores the first word returned by get_feature_names_out. It looks like an error.

I'm not sure if I understand correctly. Why would the first word be ignored?

@Greenpp
Copy link
Contributor Author

Greenpp commented May 10, 2024

I'm not sure if I understand correctly. Why would the first word be ignored?

That's because of how the values are converted to booleans. At line 140 a lookup is created that maps each word to its index (0 based), which is later used at line 144 to extract the indices. This lookup output is filtered using the condition if words_lookup.get(keyword) to prevent None values, as .get method on a dictionary returns None as a default value if the key isn't found. However, the same condition will also evaluate to False for the index 0, causing it to be ignored.

Example

[v for v in [1, None, 2, 0, 3] if v]
# [1, 2, 3]

@Greenpp
Copy link
Contributor Author

Greenpp commented May 29, 2024

Hi @MaartenGr,

I see the PR for part 1 was accepted - great!
What about the second part? Is this behavior intended? Should I open a new issue for that?

@MaartenGr
Copy link
Owner

@Greenpp Ah right, totally missed that! That indeed looks like an error which should be fixed. If you want, a PR would be appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants