Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date format validation to test_extract_from_text_properly_implemented on test_ScraperExtractFromTextTest.py #838

Open
grossir opened this issue Dec 28, 2023 · 6 comments
Assignees

Comments

@grossir
Copy link
Contributor

grossir commented Dec 28, 2023

We had an error on courlistener when extracting date_filed using extract_from_text from recently added bap1

{
    "OpinionCluster": {"date_filed": "July 29, 2022"},
},
...

File “/opt/courtlistener/cl/scrapers/tasks.py”, line 179, in extract_doc_content
    opinion.cluster.save(index=False)
....
django.core.exceptions.ValidationError: [‘“July 29, 2022" value has an invalid date format. It must be in YYYY-MM-DD format.‘]

The function test_extract_from_text_properly_implemented on test_ScraperExtractFromTextTest.py should force the user to use the proper format when dealing with date fields

@grossir grossir self-assigned this Dec 28, 2023
@flooie
Copy link
Contributor

flooie commented Dec 28, 2023

I'm going to go ahead and just push the update/fix for BAP1 but I like the idea of implementing the correct format. Thanks for tackling this.

@mlissner
Copy link
Member

I put this in the PR, but I think the proper fix for this is to do a json schema for our outputs. It'd help folks understand the code too, if we had schemas for all our scrapers that had to pass all tests before PRs were merged.

@grossir
Copy link
Contributor Author

grossir commented Jan 12, 2024

Another related error from lack of validation

https://freelawproject.sentry.io/issues/4772622463/?project=5257254&query=is%3Aunresolved&referrer=issue-stream&statsPeriod=14d&stream_index=8

As you say @mlissner a json schema to validate would definetly help

@mlissner
Copy link
Member

Yeah, let's get that prioritized. It shouldn't be terribly hard. Maybe a day or two, I'd guess.

@grossir
Copy link
Contributor Author

grossir commented Jan 27, 2024

I have been trying this implementation (docs) which seems like a healthy project

There is a small sample schema for the scrapers here

validation_schema = {
    "type": "object",
    "properties": {
        "case_names": {"type": "string"},
        "case_dates": {"type": "string", "format": "date-time"},
        "download_urls": {"type": "string"},
        "precedential_statuses": {"enum": ["Published", "Unpublished"]},
        "blocked_statuses": {"type": "boolean"},
        "date_filed_is_approximate": {"type": "boolean"},
        "citation": {"type": "string"},
        "docket": {"type": "string"},
    },
    "required": [
        "case_dates",
        "case_names",
        "download_urls",
        "precedential_statuses",
        "date_filed_is_approximate",
    ],
}

from jsonschema import Draft7Validator, FormatChecker

validator = Draft7Validator(validation_schema, format_checker=FormatChecker())
validator.validate({...})

Some nice things:

  • support for "enum" / limited options: see "precedential_statuses"
  • support for "required" fields
  • flexible type checking, for example, date-time strings
  • extensible validators for custom value types. This could be used for deferred values that are functions until consumed

In the end I think it will be faster doing these schemas by hand, since, at least for the scrapers, the scraped field names that Courtlistener expects are different from the model names proper, so changing that on the scraping side would require changes on the CL side

This schema validation could replace a part of AbstractSite._check_sanity. A separate schema can be created for the ouput of extract_from_text() functions

@mlissner
Copy link
Member

Looks great to me.

grossir added a commit to grossir/juriscraper that referenced this issue Mar 13, 2024
Add tests for required properties, for types and formats, and for additional properties, to ensure the validator and the schemas work as expected

Related to freelawproject#838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Miscellaneous
Development

No branches or pull requests

3 participants