You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now maybe_balance_style_tags only does the balancing if the tag is inmediately before the start of the token, giving some whitespace tolerance. However, from running the citation extractor we have seen the following fail cases and possible improvements:
introductory words included in the style tags: mostly see
before supra; example : "<i>AT&T, supra</i>" ; "<i>South Seas Yacht Club, supra</i>"
full case names included in the style tags, when we are only looking for a party name
Example of reference "it established in <i>State v. Wingler</i>"
From these examples, I think we could search for the matching style tag for a TOLERANCE number of any character, and include it in the span if we find it. That will help catch case names, and other "introductory" words that are not "see".
This should not create more overlapping issues than already exist; because full span overlaps have already been resolved previous to this annotation step
The text was updated successfully, but these errors were encountered:
Fixes#231
Allow searching beyond only whitespace for the missing space tag. This helps finding the missing tag when a party name or intro word "see") are included in the style span. Added tests
Right now
maybe_balance_style_tags
only does the balancing if the tag is inmediately before the start of the token, giving some whitespace tolerance. However, from running the citation extractor we have seen the following fail cases and possible improvements:introductory words included in the style tags: mostly see
"<em>See id.</em> at 642"
"(see id. at 648-650"
"<i>see Luperon</i>"
party names included in the style tags
"<i>AT&T, supra</i>"
;"<i>South Seas Yacht Club, supra</i>"
full case names included in the style tags, when we are only looking for a party name
"it established in <i>State v. Wingler</i>"
From these examples, I think we could search for the matching style tag for a TOLERANCE number of any character, and include it in the span if we find it. That will help catch case names, and other "introductory" words that are not "see".
This should not create more overlapping issues than already exist; because full span overlaps have already been resolved previous to this annotation step
The text was updated successfully, but these errors were encountered: