-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive Duplicated Sentences in LIME Text Output #1439
Comments
Hi @smbslt3! This looks like a problem with the You can reach out to Marco about this issue, but I'm not sure how actively maintained that library is anymore. Alternatively, you could try using LIT's implementation either via Python (e.g., in Colab) or via the LIT UI. However, this may add more overhead than you desire because your model and dataset will need to be wrapped (see our |
Hi @RyanMullins. When you mentioned 'historical reasons', does that mean LIT implemented LIME just for legacy purposes? So, if there is an issue with LIME, would the same issue exist in LIT? If so, it's too sad that using LIT still does not solve this problem. :(. I was checking about LIT itself, not about an alternative implementation of LIME. Thanks. |
At this point, "historical reasons" most honestly means "we don't quite remember" because we wrote that code for the original LIT release over 4 years ago... As best the team can recall, we think it was because of a dependency conflict inside Google at the time that has since been resolved. It's possible that the same issue exists but also quite possible it does not; it's very hard to tell without a specific root cause for the issue in the If you have a fully-runnable example of this behavior that's shareable (e.g., in a Colab) I would be happy to take a look and add some code to compare LIT's implementation with the one from |
I'm using LIME text to explain the results of sentiment analysis. When testing various sentences, I've noticed an excessive number of duplicated sentences being used as inputs for LIME text. This code is my setting(model is ELECTRA model, not fine-tuned)
I have set
bow=False
to treat same words different by there position in sentence, and setmask_string = '_'
for following masking validation.For instance, here is a short sentence example:
In this case, there are 599 duplicates among the various masked sentences generated. Even more concerning is that the most frequently duplicated sentence does not use any tokens at all.
Additionally, here is an example with a longer sentence:
While the frequency of duplication has decreased with longer sentences, there are still a significant number of sentences that are duplicated. Notably, the most duplicated cases include sentences that, aside from newline (\n) and backtick (```) characters, contain no tokens.
LIME is expected to mask n tokens randomly, but the outcomes don't seem random. Is this normal or a malfunction? If it's a malfunction, is it okay to remove duplicates manually for a unique sentence set? This might significantly cut down on LIME's execution time if it's unnecessary to rerun duplicate sentences.
The text was updated successfully, but these errors were encountered: