-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to customise internal HTML parser/decoder #622
Comments
An example is better to help me to understand. I believe you ran into this problem on one of your feeds. |
Sure. Consider https://www.girlswithslingshots.com/comic/rss. The feed view that I see is: I can see that comes from the content of the page but, since the site is a webcomic, that text content isn't the most relevant detail. I don't expect FeedMe to know this automatically, but it would be useful if there was a way for me to tell FeedMe to include additional content on a feed-by-feed basis. The Google Web Light version is better, but gets low-resolution images, which I guess is because of the target server's reaction to the Google user agent, and not something you can change. If I could select the content I wanted, I might also need to change the User-Agent, in case the server chooses the low-resolution images for FeedMe too. |
Oh, just aware the html parser is mobilizer in feedme. The newest 4.0.4 fixed feedbin parser issue. Please try it. |
Perhaps I am not explaining this well enough. I would like to have more control over content extraction done by the mobilizer. I assume that, from the existing three mobilizer options, the one called FeedMe mobilizer is written by you, so is the one that you have the most control over. For example, given the following RSS XML item (taken from the example feed above):
This results in one of four possible views: The feed view:I presume this is directly taken from the The image is low-quality, because the embedded In this case, FeedMe does the best it can with what it has been given -- there is not much more that can be done to improve it. The web view using the Feedbin mobilizer:I presume this is Feedbin's parsing of the response it got from the However, since this is a third-party mobilizer, I presume that there is nothing more that you can do to improve this. The web view using the Google Web Light mobilizer:The Google mobilizer does a better job of extracting content from the page behind the Again, since this is a third-party mobilizer, I presume that there is nothing more that you can do to improve this. The web view using the FeedMe mobilizer:I presume the FeedMe mobilizer uses code that you have written to extract the HTML content from the web page behind the I don't know how your code tries to identify the relevant part of the content, but I would like to be able to override (or at least influence) your logic. I don't expect your code to understand how every webpage is created, so I would like to be able to configure what is relevant for the feeds where your existing code does not work as well. As such, I would like to have a feed setting where I can enter selectors for additional content that you should include in the mobilizer result. It would be an advanced setting, and would only affect the FeedMe mobilizer. Something like (crude mockup): I have suggested using the syntax of the modern browsers' JavaScript I would also like to be able to configure the User Agent header that the FeedMe mobilizer uses to request HTML from servers, as a separate setting. This would allow me to overcome when the target server uses this header to restrict the content that is sent. I have only installed FeedMe 4.04 today (and I look forward to that fix for (null) content). The above screenshots were taken from 4.04, but the cache content may well be from when it was 4.03. I will report back if 4.04 significantly improves the content I see, but I expect the Feedbin mobilizer will not have improved its own content recognition significantly. |
I'm pretty sure this is the longest commit I've ever seen in FeedMe issues. Give me some time to read. |
Read your comment, actually this is what FeedMe mobilizer 2.0 will do. The different to your idea is I would like to provide a preview window to help user check the right area they want, not via a technical selector. But this is not a easy work, so it is not implement yet. I can try to implement your idea first, this won't take much time. |
Show new input to enter the Here is my demo: Last, this will support in 4.1. |
The documentation and demo both look great. I can imagine how difficult providing an interactive preview would be, so I think using a CSS selector is a reasonable alternative, at least to start with.
|
|
4.1 released |
I realise that this is an old thread but since some good ideas came out of this, thought I'd share my 2cents. I know that this is going to be a lot of work but will it be possible to have a parsing ability using a similar method that Feed43 uses? Will be similar to using the current system of '#' for id and '.' for class but with the added ability to tag multiple areas and rearranging them to be parsed into the final output/article. Sorry if this doesn't make sense as I'm not a developer and my understanding of things might be too simplistic and I'm probably making a fool of myself here. The bonus to having a similar method/workflow as Feed43 is that FeedMe users will be able to test out parsing 'recipes' for a webpage using Feed43's site (via either a desktop or mobile) before implementing it on the FeedMe app. Finally, perhaps after this we could have a crowdsourced 'parsing recipes' if you will and post it in the documentation area of this project. Who knows, once a modular approach is implemented, FeedMe users could simply get a recipe for whatever websites they're using, and simply add it to the mobiliser setting, or perhaps a plugin? I would like to help with the UX/UI as well as the documentations, |
Feed43 is too technical to normal user. As I mentioned before, a tap to select the text area is what feedme mobilizer want to do. |
The built in HTML decoder does not always recognise images as important content, and excludes them from the downloads. It would be nice to be able to apply to configure this in some way.
For instance, it may be useful to allow for specifying selectors that are always included, either at a top level or for specific feeds, e.g.
img#main
ordiv.main > img::nth-child(1)
. These would be patterns of extra content, in addition to that which you already extract.It may also be useful to allow overriding the User-Agent string used when fetching a particular feed's content, in case the server sends different image qualities in response to UA.
(I note that the other parsers also sometimes ignore images, or download lower-resolution images - I figured being able to configure the internal parser might be more achievable than changing the behaviour of those external services.)
Again, thanks for continuing to maintain this app!
The text was updated successfully, but these errors were encountered: