You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
user provides some text possibly containing double white space, newlines, etc
apply preprocess.normalize_whitespace
use the NER from spacy
highlight found entities in the unnormalized text
However doing (4) is kind of hard as the character coordinates (doc = nlp(text); doc.ents[0].start) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")
The text was updated successfully, but these errors were encountered:
Hi @betatim , I understand the problem, although I don't know of a "good" way to solve it. The preprocessing functions are destructive and one-way, so not a lot of thought has been given to recovering the changes. Basic question: Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.
The only solution that comes to mind is iterating over the resulting entities and re-locating them in the original text, a process which can be made more efficient than the simplest implementation but not, like, great.
This reminds me of annotating, say, keyterms visually in a PDF document while using the extracted/processed text in the analysis. It's definitely a thing I've seen done. (Unfortunately, my google-fu failed me — I couldn't find a concrete example.) Might be worth trying to track down...
Do you need to normalize the white space before using spacy's NER? It seems like weird spacing shouldn't affect the model's performance, in which case, I'd just skip the normalization.
It seem to help with things like "07\n Feb 2017" being found as a date and not as a CARDINAL and a DATE.
Was hoping you had found a nice way to do the transporting things back. Will think if we can solve it by tweaking the UI a bit.
Currently I have the following process:
preprocess.normalize_whitespace
However doing (4) is kind of hard as the character coordinates (
doc = nlp(text); doc.ents[0].start
) match up with the normalised text. Any bright ideas how to transform the coordinates back to the original string? Would be nice not to have to reformat the text the user typed in ("Hey user, we reformatted your text for you, you better like it!")The text was updated successfully, but these errors were encountered: