-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NY Times articles don't return full text #2
Comments
Nice summary. |
Update on this: I tested with the modification as you suggested, and it does fix NY Times articles. However it now completely breaks Wikipedia articles (no text renders at all.) I may do some more digging to see if I can figure out why, but I now suspect I just didn't fully understand the scoring algorithm and there's actually a good reason why that line was negated. |
Oh well. Not sure what to do in this case. We can try a different readability library maybe? |
Good idea. I took a look around and found this one, it's significantly older but also looks to be higher-quality than the other one. For example it has a much more comprehensive test suite, and the code looks more comprehensible at a glance. I tested on my fork and it seems to work well (by which I mean none of the pages I looked at are obviously broken.) It would require a little extra work to get fully up and running: loyd/readability.rs doesn't extract page metadata, so we would need to do that somewhere else. |
Nice find indeed! It's under MIT license. If you want we can create an organization and fork the repo to put it back in shape (update the dependencies, Rust 2018 edition, clippy lints, CI,....). I could make you a co-maintainer. Would you be interested? |
Org sounds good, that way if I disappear in a puff of smoke you retain access to it and won't have to fork again. 😛 FWIW I did also discover Paperoni which includes its own port of mozilla-readability. It looks very full-featured but unfortunately is not exposed in library form. The author has mentioned in a couple of issues that he plans to release it as a standalone lib eventually, but the last mention of that was in late 2021 and it doesn't look like there's been any activity since then. Just forking loyd/readability.rs seems more approachable, and I don't think it would be too hard to add metadata extraction since that's mostly just looking for a few specific meta tags. |
Works for me. To be honest I didn't like the metadata handling of the current library anyway. It is too opinionated. I'd rather like the metatags (the title, mostly) to be returned unaltered. |
Moved to org now and gave you full permissions. |
Great initiative! :) |
Thanks to the latest changes by @jfmontanaro, I think it's much better now and we can close this, right? |
If you run an NY Times article through Readable like so, you'll notice that only 9 (in this case) paragraphs are captured, out of the 30 or so in the original article.
Clearly this is an upstream bug in readability-rs, not Readable, so why am I raising an issue here? Unfortunately it looks like readable-rs may be abandoned: there's been no activity from the maintainer since April of 2021, and there are a couple of trivial pull requests still outstanding from June and September of that year, which makes me think the maintainer is unlikely to show up again any time soon.
I think I've identified the bug in readability-rs: the value of this line is negated when it shouldn't be, or at least I don't think it should be. I haven't fully grokked the scoring algorithm, but when I remove the negation and test against a NY Times article it seems to extract the whole article as I would expect.
So I see three possible courses of action:
I'll probably go ahead and submit the PR to readability-rs regardless, just in case it gets a response, but I thought I'd check with you here first since I only care about readability-rs insofar as it affects this project.
The text was updated successfully, but these errors were encountered: