Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error message when passing a df to docs #1589

Open
zilch42 opened this issue Oct 23, 2023 · 2 comments
Open

Better error message when passing a df to docs #1589

zilch42 opened this issue Oct 23, 2023 · 2 comments

Comments

@zilch42
Copy link
Contributor

zilch42 commented Oct 23, 2023

Many methods like visualize_documents take a docs argument which should be a list, but most of the time my documents are stored in a pd.dataframe because there is other metadata associated with them and I often inadvertently end up passing the data frame to these methods rather than the list. Even though there is type checking on the input:

https://github.com/MaartenGr/BERTopic/blob/62e97ddea6cdcf9e4da25f9eaed478b22a9f9e20/bertopic/plotting/_documents.py#L9C1-L11C50

It doesn't throw a helpful error (e.g. 'docs should be type List'), instead it throws a KeyError with a random number, which often takes an embarrassing amount of time trying to debug before I remember that I've been here before and know what I've done wrong.

Is there any chance that all of these methods could either do some stricter type checking or check for a dataframe on that input?

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
from bertopic import BERTopic

data = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))
docs = data['data'][0:500]
docs_df = pd.DataFrame({"docs": docs, "year": 2000})

topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)

topic_model.visualize_documents(docs_df)
KeyError                                  Traceback (most recent call last)
File [c:\Users\abb064\AppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\pandas\core\indexes\base.py:3790](file:///C:/Users/abb064/AppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/pandas/core/indexes/base.py:3790), in Index.get_loc(self, key)
   3789 try:
-> 3790     return self._engine.get_loc(casted_key)
   3791 except KeyError as err:

File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:181, in pandas._libs.index.IndexEngine.get_loc()

File pandas\_libs\hashtable_class_helper.pxi:7080, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas\_libs\hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 413

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)

File [c:\Users\XXXAppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\bertopic\_bertopic.py:2286](file:///C:/Users/XXXAppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/bertopic/_bertopic.py:2286), in BERTopic.visualize_documents(self, docs, topics, embeddings, reduced_embeddings, sample, hide_annotations, hide_document_hover, custom_labels, title, width, height)
   2216 """ Visualize documents and their topics in 2D
   2217 
   2218 Arguments:
   (...)
   2283 style="width:1000px; height: 800px; border: 0px;""></iframe>
   2284 """
   2285 check_is_fitted(self)
-> 2286 return plotting.visualize_documents(self,
   2287                                     docs=docs,
   2288                                     topics=topics,
   2289                                     embeddings=embeddings,
   2290                                     reduced_embeddings=reduced_embeddings,
   2291                                     sample=sample,
   2292                                     hide_annotations=hide_annotations,
   2293                                     hide_document_hover=hide_document_hover,
   2294                                     custom_labels=custom_labels,
   2295                                     title=title,
   2296                                     width=width,
   2297                                     height=height)

File [c:\Users\XXXAppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\bertopic\plotting\_documents.py:105](file:///C:/Users/XXXAppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/bertopic/plotting/_documents.py:105), in visualize_documents(topic_model, docs, topics, embeddings, reduced_embeddings, sample, hide_annotations, hide_document_hover, custom_labels, title, width, height)
    102 indices = np.array(indices)
    104 df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
--> 105 df["doc"] = [docs[index] for index in indices]
    106 df["topic"] = [topic_per_doc[index] for index in indices]
    108 # Extract embeddings if not already done

File [c:\Users\XXXAppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\bertopic\plotting\_documents.py:105](file:///C:/Users/XXXAppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/bertopic/plotting/_documents.py:105), in <listcomp>(.0)
    102 indices = np.array(indices)
    104 df = pd.DataFrame({"topic": np.array(topic_per_doc)[indices]})
--> 105 df["doc"] = [docs[index] for index in indices]
    106 df["topic"] = [topic_per_doc[index] for index in indices]
    108 # Extract embeddings if not already done

File [c:\Users\XXXAppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\pandas\core\frame.py:3896](file:///C:/Users/XXXAppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/pandas/core/frame.py:3896), in DataFrame.__getitem__(self, key)
   3894 if self.columns.nlevels > 1:
   3895     return self._getitem_multilevel(key)
-> 3896 indexer = self.columns.get_loc(key)
   3897 if is_integer(indexer):
   3898     indexer = [indexer]

File [c:\Users\XXX\AppData\Local\miniconda3\envs\csiro-horizon-scanning39\lib\site-packages\pandas\core\indexes\base.py:3797](file:///C:/Users/XXXAppData/Local/miniconda3/envs/csiro-horizon-scanning39/lib/site-packages/pandas/core/indexes/base.py:3797), in Index.get_loc(self, key)
   3792     if isinstance(casted_key, slice) or (
   3793         isinstance(casted_key, abc.Iterable)
   3794         and any(isinstance(x, slice) for x in casted_key)
   3795     ):
   3796         raise InvalidIndexError(key)
-> 3797     raise KeyError(key) from err
   3798 except TypeError:
   3799     # If we have a listlike key, _check_indexing_error will raise
   3800     #  InvalidIndexError. Otherwise we fall through and re-raise
   3801     #  the TypeError.
   3802     self._check_indexing_error(key)

KeyError: 413
@MaartenGr
Copy link
Owner

Even though there is type checking on the input:

Technically, there is not type checking on the input but only type hinting. The difference, in part, lies with some of the design philosophies of Python where typing is not enforced.

Is there any chance that all of these methods could either do some stricter type checking or check for a dataframe on that input?

Sure, I would have to check for the best implementation though. I do not think stricter type checking is the solution here since if I were to start doing that here, shouldn't I do it for all other variables? However, allowing for a pandas series to be passed should be possible. An entire pandas dataframe is a different story though as that would also need to involve checking the columns to use which can open up some problems.

@zilch42
Copy link
Contributor Author

zilch42 commented Oct 24, 2023

Technically, there is not type checking on the input but only type hinting. The difference, in part, lies with some of the design philosophies of Python where typing is not enforced.

Ah thanks, I hadn't actually realised that distinction.

shouldn't I do it for all other variables?

I see your point. Maybe. But I think most other variables are pretty specific too BERTopic (i.e. a user would create them specifically to pass to a BERTopic function), so they are more likely to be in the right format from the beginning. docs is the exception because they're the beginning of any text analysis pipeline and a user may be doing more with their docs than just running BERTopic so I think that's the variable with a higher risk of confusion and ambiguity. And when a user gets that wrong, a KeyError doesn't give much direction in how to fix it.

An entire pandas dataframe is a different story though as that would also need to involve checking the columns to use which can open up some problems.

Certainly not suggesting you allow a dataframe. Agree, way too complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants