Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NY Times articles don't return full text #2

Open
jfmontanaro opened this issue Nov 5, 2022 · 10 comments
Open

NY Times articles don't return full text #2

jfmontanaro opened this issue Nov 5, 2022 · 10 comments

Comments

@jfmontanaro
Copy link
Contributor

jfmontanaro commented Nov 5, 2022

If you run an NY Times article through Readable like so, you'll notice that only 9 (in this case) paragraphs are captured, out of the 30 or so in the original article.

Clearly this is an upstream bug in readability-rs, not Readable, so why am I raising an issue here? Unfortunately it looks like readable-rs may be abandoned: there's been no activity from the maintainer since April of 2021, and there are a couple of trivial pull requests still outstanding from June and September of that year, which makes me think the maintainer is unlikely to show up again any time soon.

I think I've identified the bug in readability-rs: the value of this line is negated when it shouldn't be, or at least I don't think it should be. I haven't fully grokked the scoring algorithm, but when I remove the negation and test against a NY Times article it seems to extract the whole article as I would expect.

So I see three possible courses of action:

  • Submit a pull request to readability-rs and see if it gets a response (this has the advantage that if I'm wrong and there's actually some good reason the algorithm is done this way, the maintainer will presumably know)
  • Fork readability-rs and fix the bug
  • Not bother with any of this since it seems to work okay on most sites and it is, after all, just a fun side project that you threw together for the heck of it. :)

I'll probably go ahead and submit the PR to readability-rs regardless, just in case it gets a response, but I thought I'd check with you here first since I only care about readability-rs insofar as it affects this project.

@mre
Copy link
Collaborator

mre commented Nov 5, 2022

Nice summary.
I noticed that, too but I thought it was because of the paywall on New York Times.
Looking at the code your suspicion might be correct. Guess you can easily test it out by forking the repo and using the fork in your local checkout of readable and starting it with make dev.
Always here to help if you need someone to try it out.

@jfmontanaro
Copy link
Contributor Author

Update on this: I tested with the modification as you suggested, and it does fix NY Times articles. However it now completely breaks Wikipedia articles (no text renders at all.) I may do some more digging to see if I can figure out why, but I now suspect I just didn't fully understand the scoring algorithm and there's actually a good reason why that line was negated.

@mre
Copy link
Collaborator

mre commented Nov 7, 2022

Oh well. Not sure what to do in this case. We can try a different readability library maybe?

@jfmontanaro
Copy link
Contributor Author

Good idea. I took a look around and found this one, it's significantly older but also looks to be higher-quality than the other one. For example it has a much more comprehensive test suite, and the code looks more comprehensible at a glance.

I tested on my fork and it seems to work well (by which I mean none of the pages I looked at are obviously broken.)

It would require a little extra work to get fully up and running: loyd/readability.rs doesn't extract page metadata, so we would need to do that somewhere else.

@mre
Copy link
Collaborator

mre commented Nov 7, 2022

Nice find indeed! It's under MIT license. If you want we can create an organization and fork the repo to put it back in shape (update the dependencies, Rust 2018 edition, clippy lints, CI,....). I could make you a co-maintainer. Would you be interested?
Alternatively you could fork it in your own namespace and create a PR to make it a dependency of readable.
Whatever you prefer. 😉

@jfmontanaro
Copy link
Contributor Author

Org sounds good, that way if I disappear in a puff of smoke you retain access to it and won't have to fork again. 😛

FWIW I did also discover Paperoni which includes its own port of mozilla-readability. It looks very full-featured but unfortunately is not exposed in library form. The author has mentioned in a couple of issues that he plans to release it as a standalone lib eventually, but the last mention of that was in late 2021 and it doesn't look like there's been any activity since then.

Just forking loyd/readability.rs seems more approachable, and I don't think it would be too hard to add metadata extraction since that's mostly just looking for a few specific meta tags.

@mre
Copy link
Collaborator

mre commented Nov 8, 2022

Works for me. To be honest I didn't like the metadata handling of the current library anyway. It is too opinionated. I'd rather like the metatags (the title, mostly) to be returned unaltered.
I'm also curious how loyd/readability.rs handles code blocks. The current library modifies them (indentation and all) and that's another thing I'd love to see fixed.
Let's see how it goes.

@mre
Copy link
Collaborator

mre commented Nov 8, 2022

Moved to org now and gave you full permissions.

@simonsan
Copy link

simonsan commented Nov 9, 2022

Great initiative! :)

@mre
Copy link
Collaborator

mre commented Nov 19, 2022

Thanks to the latest changes by @jfmontanaro, I think it's much better now and we can close this, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants