Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read individual articles for Atom and RSS 1.0 feeds #10

Open
yonas opened this issue Dec 29, 2022 · 6 comments
Open

Unable to read individual articles for Atom and RSS 1.0 feeds #10

yonas opened this issue Dec 29, 2022 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@yonas
Copy link

yonas commented Dec 29, 2022

nom successfully lists the feed items, but attempting to read an individual article only shows the title and date:

   Moscow on the Med: A Faraway War Transforms a Turkish Resort Town          
                                                                              
   2022-12-29 10:00:26 +0000 UTC     

Tested with https://rss.nytimes.com/services/xml/rss/nyt/World.xml

@yonas
Copy link
Author

yonas commented Dec 29, 2022

Doesn't work for this RSS v1.0 feed as well - http://feeds.bbci.co.uk/news/rss.xml

@yonas yonas changed the title Unable to read individual articles for Atom feeds Unable to read individual articles for Atom and RSS 1.0 feeds Dec 29, 2022
@guyfedwards
Copy link
Owner

Hmm, this is an interesting case. These feeds are just links to the articles without containing the content. We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

One option here would be to just open the links that have no content in the browser using xdg-open or similar.

Would that suit your usecase? I can try and look at parsing but this opens up a large can of worms, even the times site in your example requires a times account to actually get the content.

@yonas
Copy link
Author

yonas commented Dec 30, 2022

Hi @guyfedwards

We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

Yes, it might be a bit of a challenge. You'll want to find a library to strip script, style, and noscript tags, and another to convert html to markdown. I see you're already making use of html-to-markdown.

One option here would be to just open the links that have no content in the browser using xdg-open or similar.
Would that suit your usecase?

Not quite. I'm interested in reading the article in the terminal via the RSS reader without leaving the app.

...even the times site in your example requires a times account to actually get the content.

I'm able to get the article content via w3m https://www.nytimes.com/live/2022/12/29/world/russia-ukraine-news. Does this work for you?

@guyfedwards guyfedwards added the help wanted Extra attention is needed label Jan 2, 2023
@Nemoden
Copy link
Contributor

Nemoden commented Jan 28, 2023

Well, w3m is a complete browser, surely it does display an article as it practically renders all the elements.

Fetching ONLY an article from a web page is a little tricky. I'm not sure if there is any good way of scraping only the article contents from HTML, stripping out headers, navigations, sidebars, footer etc as the structure of a web page containing an article isn't standardized. In fact, I can build a whole page with any of the blocks out of styled div elements (not even classes). Not only that many pages would container skeletons for things like modal windows, or hidden content that becomes visible on click (or other user interaction with the interface), etc. Some will also have full content visible only with a subscription.

Returning back to the solution on how to fetch article contents from a webpage... One way I can think of finding an element with the highest word density (after all the tags are stripped) maybe?

All in all, this is quite a hefty task. If something like this is implemented that'd be great. I personally have some feeds like that in my newsboat config, and those are displayed with only a link to the article.

@guyfedwards
Copy link
Owner

I think short-term, opening the link is a sufficient solution, longer term we can look at adding html parsing capability but will be a bit more of a challenge.

@apainintheneck
Copy link

The circumflex program allows you to read Hacker News articles in the terminal. They accomplish that by using the Go-Readability package to find the main readable content and the metadata from a HTML page. After that comes some post-processing to turn it into markdown that can be displayed in the terminal (relevant code). My understanding is that this is not 100% but anecdotally it seems to work for most articles that get posted on HN.

Maybe a similar approach would work here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants