Unable to read individual articles for Atom and RSS 1.0 feeds #10

yonas · 2022-12-29T21:09:13Z

nom successfully lists the feed items, but attempting to read an individual article only shows the title and date:

   Moscow on the Med: A Faraway War Transforms a Turkish Resort Town          
                                                                              
   2022-12-29 10:00:26 +0000 UTC

Tested with https://rss.nytimes.com/services/xml/rss/nyt/World.xml

The text was updated successfully, but these errors were encountered:

yonas · 2022-12-29T21:13:09Z

Doesn't work for this RSS v1.0 feed as well - http://feeds.bbci.co.uk/news/rss.xml

guyfedwards · 2022-12-29T23:59:02Z

Hmm, this is an interesting case. These feeds are just links to the articles without containing the content. We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

One option here would be to just open the links that have no content in the browser using xdg-open or similar.

Would that suit your usecase? I can try and look at parsing but this opens up a large can of worms, even the times site in your example requires a times account to actually get the content.

yonas · 2022-12-30T00:24:43Z

Hi @guyfedwards

We could attempt to fetch the content from the URL but the markdown conversion from a full page html is likely to be funky.

Yes, it might be a bit of a challenge. You'll want to find a library to strip script, style, and noscript tags, and another to convert html to markdown. I see you're already making use of html-to-markdown.

One option here would be to just open the links that have no content in the browser using xdg-open or similar.
Would that suit your usecase?

Not quite. I'm interested in reading the article in the terminal via the RSS reader without leaving the app.

...even the times site in your example requires a times account to actually get the content.

I'm able to get the article content via w3m https://www.nytimes.com/live/2022/12/29/world/russia-ukraine-news. Does this work for you?

Nemoden · 2023-01-28T07:58:24Z

Well, w3m is a complete browser, surely it does display an article as it practically renders all the elements.

Fetching ONLY an article from a web page is a little tricky. I'm not sure if there is any good way of scraping only the article contents from HTML, stripping out headers, navigations, sidebars, footer etc as the structure of a web page containing an article isn't standardized. In fact, I can build a whole page with any of the blocks out of styled div elements (not even classes). Not only that many pages would container skeletons for things like modal windows, or hidden content that becomes visible on click (or other user interaction with the interface), etc. Some will also have full content visible only with a subscription.

Returning back to the solution on how to fetch article contents from a webpage... One way I can think of finding an element with the highest word density (after all the tags are stripped) maybe?

All in all, this is quite a hefty task. If something like this is implemented that'd be great. I personally have some feeds like that in my newsboat config, and those are displayed with only a link to the article.

guyfedwards · 2023-01-28T10:08:33Z

I think short-term, opening the link is a sufficient solution, longer term we can look at adding html parsing capability but will be a bit more of a challenge.

apainintheneck · 2024-09-22T00:42:21Z

The circumflex program allows you to read Hacker News articles in the terminal. They accomplish that by using the Go-Readability package to find the main readable content and the metadata from a HTML page. After that comes some post-processing to turn it into markdown that can be displayed in the terminal (relevant code). My understanding is that this is not 100% but anecdotally it seems to work for most articles that get posted on HN.

Maybe a similar approach would work here?

yonas changed the title ~~Unable to read individual articles for Atom feeds~~ Unable to read individual articles for Atom and RSS 1.0 feeds Dec 29, 2022

guyfedwards added the help wanted Extra attention is needed label Jan 2, 2023

Lovebird-Connoisseur mentioned this issue Jan 29, 2023

A couple of ideas #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read individual articles for Atom and RSS 1.0 feeds #10

Unable to read individual articles for Atom and RSS 1.0 feeds #10

yonas commented Dec 29, 2022

yonas commented Dec 29, 2022

guyfedwards commented Dec 29, 2022

yonas commented Dec 30, 2022

Nemoden commented Jan 28, 2023 •

edited

Loading

guyfedwards commented Jan 28, 2023

apainintheneck commented Sep 22, 2024

Unable to read individual articles for Atom and RSS 1.0 feeds #10

Unable to read individual articles for Atom and RSS 1.0 feeds #10

Comments

yonas commented Dec 29, 2022

yonas commented Dec 29, 2022

guyfedwards commented Dec 29, 2022

yonas commented Dec 30, 2022

Nemoden commented Jan 28, 2023 • edited Loading

guyfedwards commented Jan 28, 2023

apainintheneck commented Sep 22, 2024

Nemoden commented Jan 28, 2023 •

edited

Loading