Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to maybe_balance_style_tags #231

Open
grossir opened this issue Feb 21, 2025 · 1 comment · May be fixed by #239
Open

Improvements to maybe_balance_style_tags #231

grossir opened this issue Feb 21, 2025 · 1 comment · May be fixed by #239
Assignees

Comments

@grossir
Copy link
Contributor

grossir commented Feb 21, 2025

Right now maybe_balance_style_tags only does the balancing if the tag is inmediately before the start of the token, giving some whitespace tolerance. However, from running the citation extractor we have seen the following fail cases and possible improvements:

  • introductory words included in the style tags: mostly see

  • party names included in the style tags

    • before supra; example : "<i>AT&T, supra</i>" ; "<i>South Seas Yacht Club, supra</i>"
  • full case names included in the style tags, when we are only looking for a party name

    • Example of reference "it established in <i>State v. Wingler</i>"

From these examples, I think we could search for the matching style tag for a TOLERANCE number of any character, and include it in the span if we find it. That will help catch case names, and other "introductory" words that are not "see".

This should not create more overlapping issues than already exist; because full span overlaps have already been resolved previous to this annotation step

@flooie flooie moved this to Backlog Feb 24 to March 7 in Case Law Sprint Feb 24, 2025
@grossir grossir self-assigned this Feb 24, 2025
@flooie flooie moved this from Backlog Feb 24 to March 7 to To Do in Case Law Sprint Feb 27, 2025
@grossir grossir moved this from To Do to In progress in Case Law Sprint Mar 3, 2025
Copy link

sentry-io bot commented Mar 7, 2025

We briefly collected some data when the logger was active

Sentry Issue: COURTLISTENER-96A

grossir added a commit that referenced this issue Mar 7, 2025
Fixes #231

Allow searching beyond only whitespace for the missing space tag. This helps finding the missing tag when a party name or intro word "see") are included in the style span. Added tests
@grossir grossir moved this from In progress to PR'd Issues 🤞 in Case Law Sprint Mar 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PR'd Issues 🤞
Development

Successfully merging a pull request may close this issue.

1 participant