-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new colo
and coloctapp
problems with opinion content display rendering
#1094
Comments
No PDF option for these new courts? I suppose huh. |
Helps solve: - https://github.com/freelawproject/courtlistener/issues/4443 - #1094 Implement cleanup_content, only for coloctapp: - to remove tags with tokens that changed every scrape, altering the hash and causing duplicates - to remove classes that conflicted with CL classes and messed up the diplay
It was more difficult to get the opinions as PDFs before, now it seems like a single request. I haven't checked if the PDFs have time related tags or something, though. I have implemented cleanup_content for |
Now doctor is interpreting the content as "txt", which makes the rendering look bad |
@grossir can you push a fix for this - the problem stems from doctor being unable to identify HTML if the HTML tag is not present. We should simply just wrap the final extracted content inside an HTML tag and that should fix this issue |
We recently changed the
colo
scraper, and some newly scraped documents are rendering as raw HTML. I think it's because doctor is tagging them astxt
files. I am not sure if this can be fixed on the scraper side,They look like this, and the opinion backup is on the
txt
folderOn the other hand, some that have been identified properly as HTML, look strange, 2 which may be fixed using
Site.cleanup_content
The text was updated successfully, but these errors were encountered: