new `colo` and `coloctapp` problems with opinion content display rendering #1094

grossir · 2024-07-30T01:30:53Z

We recently changed the colo scraper, and some newly scraped documents are rendering as raw HTML. I think it's because doctor is tagging them as txt files. I am not sure if this can be fixed on the scraper side,

They look like this, and the opinion backup is on the txt folder

On the other hand, some that have been identified properly as HTML, look strange, 2 which may be fixed using Site.cleanup_content

The text was updated successfully, but these errors were encountered:

flooie · 2024-08-07T19:34:24Z

No PDF option for these new courts? I suppose huh.

Helps solve: - https://github.com/freelawproject/courtlistener/issues/4443 - #1094 Implement cleanup_content, only for coloctapp: - to remove tags with tokens that changed every scrape, altering the hash and causing duplicates - to remove classes that conflicted with CL classes and messed up the diplay

grossir · 2024-09-13T04:11:19Z

It was more difficult to get the opinions as PDFs before, now it seems like a single request. I haven't checked if the PDFs have time related tags or something, though. I have implemented cleanup_content for coloctapp, for now

grossir · 2024-10-09T13:19:50Z

Now doctor is interpreting the content as "txt", which makes the rendering look bad

flooie · 2024-10-22T20:51:23Z

@grossir can you push a fix for this - the problem stems from doctor being unable to identify HTML if the HTML tag is not present. We should simply just wrap the final extracted content inside an HTML tag and that should fix this issue

grossir mentioned this issue Sep 13, 2024

Identical opinions get different hashes in coloctapp #1215

Closed

grossir mentioned this issue Sep 13, 2024

feat(coloctapp): implement cleanup_content #1170

Merged

grossir mentioned this issue Oct 16, 2024

DataError: value too long for type character varying(100) freelawproject/courtlistener#4570

Open

flooie assigned grossir Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new `colo` and `coloctapp` problems with opinion content display rendering #1094

new `colo` and `coloctapp` problems with opinion content display rendering #1094

grossir commented Jul 30, 2024

flooie commented Aug 7, 2024 •

edited

Loading

grossir commented Sep 13, 2024

grossir commented Oct 9, 2024

flooie commented Oct 22, 2024

new colo and coloctapp problems with opinion content display rendering #1094

new colo and coloctapp problems with opinion content display rendering #1094

Comments

grossir commented Jul 30, 2024

flooie commented Aug 7, 2024 • edited Loading

grossir commented Sep 13, 2024

grossir commented Oct 9, 2024

flooie commented Oct 22, 2024

new `colo` and `coloctapp` problems with opinion content display rendering #1094

new `colo` and `coloctapp` problems with opinion content display rendering #1094

flooie commented Aug 7, 2024 •

edited

Loading