-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PartOfSpeech representation reproducibility and word with index 0 #1981
Comments
Amazing, great catch! That's also a nasty habit of mine so I wouldn't be surprised if that happens in other places as well.
Sounds good and a minimal change as well, which I prefer!
I'm not sure if I understand correctly. Why would the first word be ignored? |
That's because of how the values are converted to booleans. At line 140 a lookup is created that maps each word to its index (0 based), which is later used at line 144 to extract the indices. This lookup output is filtered using the condition Example[v for v in [1, None, 2, 0, 3] if v]
# [1, 2, 3] |
Hi @MaartenGr, I see the PR for part 1 was accepted - great! |
@Greenpp Ah right, totally missed that! That indeed looks like an error which should be fixed. If you want, a PR would be appreciated! |
Hi Maarten
Problem
Part 1
While working with BERTopic, I encountered a problem with reproducibility of representations. I made sure to set
random_state
wherever possible. After reviewing all similar issues and trying things like disabling MST approximation in HDBSCAN (approx_min_span_tree=False
) or setting global random state with numpy (numpy.random.seed
), I started digging into the library.I found that the values remained constant until the
PartOfSpeech
representation module and switching it to another one resolved the issue. The problem appears to be initially caused by the deduplication method (list(set())
) used at lines 121 and 130. Because thehash
function used for generating set keys is seeded at the start of the interpreter (seed can be overridden usingPYTHONHASHSEED
env variable), the output of such deduplication is different with each run. This behavior causesword_indices
at line 144 to change with each run. Which is later problematic when sorting keywords with the same c-TF-IDF as they are arranged differently.Part 2
When looking at the PoS code, I noticed that
word_indices
at line 144 are generated using the following conditionif words_lookup.get(keyword)
which ignores the first word returned byget_feature_names_out
. It looks like an error.MRE
Running this example can produce different representations e.g.
2_season_hockey_player_active
,2_season_hockey_active_player
, asplayer
andactive
both have c-TF-IDF of0.009478917304532486
.Solution
As the contribution guide suggests starting with an issue, I will post my suggestions here.
Part 1
word_indices
at line 144 using numpy. This will ensure consistent ordering of words, should be faster than built-in sort, and will transform them into numpy array for further operations.Part 2
if words_lookup.get(keyword)
condition toif keyword in words_lookup
.The text was updated successfully, but these errors were encountered: