Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new colo and coloctapp problems with opinion content display rendering #1094

Open
grossir opened this issue Jul 30, 2024 · 4 comments
Open
Assignees

Comments

@grossir
Copy link
Contributor

grossir commented Jul 30, 2024

We recently changed the colo scraper, and some newly scraped documents are rendering as raw HTML. I think it's because doctor is tagging them as txt files. I am not sure if this can be fixed on the scraper side,

They look like this, and the opinion backup is on the txt folder

image


On the other hand, some that have been identified properly as HTML, look strange, 2 which may be fixed using Site.cleanup_content

image

@flooie
Copy link
Contributor

flooie commented Aug 7, 2024

No PDF option for these new courts? I suppose huh.

grossir added a commit that referenced this issue Sep 13, 2024
Helps solve:
- https://github.com/freelawproject/courtlistener/issues/4443
- #1094

Implement cleanup_content, only for coloctapp:
-  to remove tags with tokens that changed every scrape, altering the hash and causing duplicates
- to remove classes that conflicted with CL classes and messed up the diplay
@grossir
Copy link
Contributor Author

grossir commented Sep 13, 2024

It was more difficult to get the opinions as PDFs before, now it seems like a single request. I haven't checked if the PDFs have time related tags or something, though. I have implemented cleanup_content for coloctapp, for now

@grossir
Copy link
Contributor Author

grossir commented Oct 9, 2024

Now doctor is interpreting the content as "txt", which makes the rendering look bad
image

@flooie
Copy link
Contributor

flooie commented Oct 22, 2024

@grossir can you push a fix for this - the problem stems from doctor being unable to identify HTML if the HTML tag is not present. We should simply just wrap the final extracted content inside an HTML tag and that should fix this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: State Supreme/Appellate
Development

No branches or pull requests

2 participants