-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Token Classification emojis cause overlapping spans error & wrong annotations #2353
Comments
I've found a related issue. import argilla as rg
records = [
rg.TokenClassificationRecord(
text="I ❤️ you", tokens=["I", "❤️", "you"], prediction=[("I", 0, 1), ("emoji", 2, 4), ("you", 5, 8)]
),
rg.TokenClassificationRecord(
text="I 💚 you", tokens=["I", "💚", "you"], prediction=[("I", 0, 1), ("emoji", 2, 3), ("you", 4, 7)]
),
rg.TokenClassificationRecord(
text="I h you", tokens=["I", "h", "you"], prediction=[("I", 0, 1), ("emoji", 2, 3), ("you", 4, 7)]
),
]
rg.delete("issue_2353_emoji")
rg.log(records, "issue_2353_emoji") Note also the following awkward Python behaviour: >>> len("h")
1
>>> len("💚")
1
>>> len("❤️")
2
>>> |
emoji, halfwidth chars, double chars are never easy to work with 🥲 |
I've expanded on @cceyda's useful example with the red heart to show that the issue only exists with the colored one: import argilla as rg
rg.delete("issue_2353_emoji")
tokens = ["💚", "a", "b", "c", "d", "💚", "e", "f", "g", "j", "k", "l"]
text = "💚abcd💚efgjkl"
entities = [("A", 6, 9)]
assert text[6:9] == "efg"
record = rg.TokenClassificationRecord(
text=text,
tokens=tokens,
prediction=entities,
)
rg.log(record, "issue_2353_emoji")
tokens = ["❤️", "a", "b", "c", "d", "❤️", "e", "f", "g", "j", "k", "l"]
text = "❤️abcd❤️efgjkl"
entities = [("A", 8, 11)]
assert text[8:11] == "efg"
record = rg.TokenClassificationRecord(
text=text,
tokens=tokens,
prediction=entities,
)
rg.log(record, "issue_2353_emoji") |
I saw there was a related PR 1year ago 8b570fb |
Maybe @leiyre can give more context |
https://hsivonen.fi/string-length/ "🤦🏼♂️","🤦🏼","💖", "💘", "💝", "💞", "❣️", "✨". |
Hi, in 8b570fb we included stringz in the front to avoid the problem with unicode and javascript. @tomaarsen could I see your example in dev or pre to explore a bit more the behavior? |
@leiyre I've pushed my latest example to dev under |
Here is another thing: Screen.Recording.2023-02-17.at.0.38.25.movI mean I understand why it happens but not how to fix it 🤣 .
I don't really care about the visual artifact on the UI as long as the annotation boundries are passed correctly when using in python! like label_map={
"A" : "location","B" : "organization","C" : "building", "D" : "vegetable","E" : "fruit","F" : "street","G" : "apt", "J" : "no",
"K" : "food","L" : "drink","M" : "alcohol","N" : "name","O" : "phone","P" : "etc","R" : "blah","S" : "something","T" : "somethingelse"
}
labels=list(label_map.keys())
texts=[
'ABCDEFG 🤦🏼♂️ jklmnop',
'ABCDEFG 💞 jklmnop',
'ABCDEFG 💚jklmnop',
'ABCDEFG (❣️) jklmnop'
]
records=[]
for i,text in enumerate(texts):
record = rg.TokenClassificationRecord(
text=text,
tokens=list(text.replace(" ","")),
prediction=[(labels[i],i,i+1)],
prediction_agent="model",
# annotation=entities,
# annotation_agent="old",
metadata= { },
status='Default',
id=i
)
records.append(record)
rg.log(records=records, name="test-emoji") Also question would this |
Hola ! When looking at the problem, we saw with @leiyre that some updates need to be done in front. |
yes javascript uses UTF-16 encoding to calculate string lengths. While python counts codepoints(or utf-8 encoding bytes) import grapheme
string="👩❤️💋👩🐦한ABதமிழ்💚🤦🏼♂️" #example problematic strings I gathered from around the internet.Just need to be composed of more than one unicode codepoint to cause problems.
chars=list(grapheme.graphemes(string))
for char in chars:
print(f"original:{char}")
print(f"codepoints:{list(char)}")
for l in list(char):
print(l,ord(l))
for i,b in enumerate(bytes(l,encoding='utf-8')):
print(i,b)
print("-----")
print("#############") |
Also learned that in JS if you use array expansion(?) you can get the number of codepoints accurately (same as python)
|
I think you mean Spread operator. [..."👩💻"].length = 3 |
but that is how python counts too! It count's code points. How humans perceive a single letter(A,B,C etc)(can think of this as the grapheme) and how a single grapheme is represented (by one or more unicode points), and how those points are encoded (UTF-16, UTF-8) are all different things. also was helpful: https://stackoverflow.com/a/51422499/3726119 On the UI side we want graphemes (using Intl.Segmenter() polyfill can work for this), on the backend we want to calculate codepoints = python. whereas JS natively (just doing "str".length) calculates UTF-16 encoding length. |
@cceyda If that is indeed consistently equivalent to the Python lengths, then perhaps we can use that to place the spans correctly by only changing the frontend? Or would we strictly need to use e.g. the |
I would heavily suggest not straying from the norm of using We should use graphemes only on the UI side because that is what makes sense to human eyes 👀 😄 |
This issue is stale because it has been open for 90 days with no activity. |
bump as still important |
This issue is stale because it has been open for 90 days with no activity. |
bump |
This issue is stale because it has been open for 90 days with no activity. |
This issue was closed because it has been inactive for 30 days since being marked as stale. |
Describe the bug
If there is a prediction annotation mismatch + an emoji 💚 (haven't tested with other emojis)
on the UI this shows error:
I was told clearing all annotations and then annotating and saving works sometimes!
on the server side caused by
ValueError: IOB tags cannot handle overlapping spans!
Steps to reproduce
I'm using char tokens:
Environment (please complete the following information):
Additional context
If you are validating multiple records at once and one of them fails the others fail too, and if the annotator is not careful about the toast error message shown moves on to the next page then their annotation on the prev page are lost.
Maybe we can show an alert() popup when there are unsaved annotations and the user is trying to navigate out?
The text was updated successfully, but these errors were encountered: