Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix : emoji length cause overlapping #2621

Closed
wants to merge 28 commits into from

Conversation

keithCuniah
Copy link
Contributor

@keithCuniah keithCuniah commented Mar 30, 2023

Description

Because Emoji not always have a size of one, this can cause overlapping in prediction and annotation.

Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

Closes #2353

Type of change
Changes in Back and Front need to be done

Front

  • repair test recordHasEmoji in RecordTokenClassification
  • performance => move computed to static variable in RecordTokenClassification
  • ensure that we don't have an offset depending of the emoji length

Back

  • ensure that we don't have an offset depending of the emoji length

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (change restructuring the codebase without changing functionality)
  • Improvement (change adding some improvement to an existing functionality)
  • Documentation update

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

  • Token classification

Checklist

  • I have merged the original branch into my forked branch
  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

keithCuniah added 28 commits January 30, 2023 16:10
@keithCuniah keithCuniah requested a review from tomaarsen March 30, 2023 10:35
@cceyda
Copy link
Contributor

cceyda commented May 10, 2023

I'm looking forward to the resolution of this bug, but I don't think this PR is exactly the solution. As the issue is not limited to emojis

@keithCuniah
Copy link
Contributor Author

Hi @cceyda, sorry for the delay of answer. You are right, the solution will be update. We will tackle this fix as soon as possible :)

@keithCuniah
Copy link
Contributor Author

Hello @cceyda, sorry again for the delay of answer.
The problem in the front is how we calculate the token length with the library stringz in RecordTextClassification component : with this library, we consider always a character to have a length equal to one (normal character or emoji). So to implement the solution you have suggested with the spread operator [...emoji_or_character], we need to calculate the prediction at the character level and not at the token level as it is right now.
The solution is quite complex to tackle right now, we will implement it in another issue.

@frascuchon frascuchon closed this Apr 16, 2024
@frascuchon frascuchon deleted the fix/2353-emoji-overlapping branch April 16, 2024 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Token Classification emojis cause overlapping spans error & wrong annotations
4 participants