Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS Selector Bridge 2 #3626

Merged
merged 8 commits into from
Aug 22, 2023
Merged

Conversation

LarsStegman
Copy link
Contributor

Hey maintainers,

I was working with the CSS Selector bridge, but I was missing some features. This is a pretty big change (breaking) in how the bridge works, so I made it CssSelector2Bridge.

It adds support for:

  • Cookies
  • Timestamps
  • Categories
  • Author

It works by this principle:

  1. From the page containing the collection of articles, select all elements describing the articles
  2. From each article element, select the article URLs
  3. If needed, download every article page and use the provided article selector to find the article content element
  4. Parse the article contents using the provided selectors and filters

I made this because I found the original parser a bit too limited for my needs.

@github-actions
Copy link

github-actions bot commented Aug 20, 2023

Pull request artifacts

file last change
CssSelectorComplexBridge-pr-context1 2023-08-22, 19:05:53

@dvikan
Copy link
Contributor

dvikan commented Aug 21, 2023

@ORelio

@ORelio
Copy link
Contributor

ORelio commented Aug 21, 2023

Looks nice!

I designed the CSS Selector bridge to be as simple as possible because we can't address everything using a selector bridge: each site may have its little quirks requiring special handling, so CssSelectorBridge bridge was not designed to be a full replacement of the regular way of writing bridges in PHP, but rather a quick way of generating a feed for simple sites that don't have any specificity to take into account. Addressing everything may be frustrating and/or lead to feature bloat, difficulty to maintain the bridge to accommodate fixes for each user without breaking existing feeds.

Note that if the site has a truncated RSS feed, you can use WordPressBridge to expand it and retain metadata such as author and timestamp from the feed. The site does not need to run WordPress for the bridge to work as you can specify a custom content selector if needed, so WordPressBridge should be preferred over CssSelectorBridge when possible.

Now, let's assume that we still want to generate complex feeds using CSS Selectors anyway and not write a PHP bridge.
I see that the changes will indeed break existing feeds, so CssSelector2Bridge makes sense.
Here are some thoughts about the added features:

  • Cookies: Are you sure your cookie will not expire after some time, requiring you to recreate/update your feed? Seems more convenient to write a bridge in PHP that would first fetch the cookie automatically, then use it. I'd be curious of your use case for that feature in Css Selector Bridge 🤔

  • Timestamps: After giving it some thought, I did not implement them in the CSS Selector Bridge because of so many sites that use timestamps not compatible with strtotime(). Adding support for timestamps would work better if also adding support for custom date formats as additional setting. Also, as if it wasn't already complicated enough, the timestamp may be either on home page or on article page. So an additional switch would be needed (that makes 3 settings just for the timestamp). But then I realized that when timestamp is missing, feed aggregators take the time where the item is first discovered, which does the job well in most cases. If you prefer adding support for timestamps, consider adding format setting. That will surely be handy.

  • Categories/Author: Note that in some case, they may be present on home page only.

  • remove_styling: Note that this is usually up to the feed reader application to do this.

The CssSelector2Bridge looks nice as is, so you are not required to make further changes. I'm just making suggestions.
Just let @dvikan know when you are ready for merging.

@LarsStegman
Copy link
Contributor Author

Thanks for your detailed reply!

I designed the CSS Selector bridge to be as simple as possible because we can't address everything using a selector bridge: each site may have its little quirks requiring special handling, so CssSelectorBridge bridge was not designed to be a full replacement of the regular way of writing bridges in PHP, but rather a quick way of generating a feed for simple sites that don't have any specificity to take into account. Addressing everything may be frustrating and/or lead to feature bloat, difficulty to maintain the bridge to accommodate fixes for each user without breaking existing feeds.

I agree there is something to say for keeping it more simple. I do think this will add valuable functionality, since the way to select content is more powerful.

Note that if the site has a truncated RSS feed, you can use WordPressBridge to expand it and retain metadata such as author and timestamp from the feed. The site does not need to run WordPress for the bridge to work as you can specify a custom content selector if needed, so WordPressBridge should be preferred over CssSelectorBridge when possible.

I am not sure what you mean. I don't think I did anything to expand truncated feeds.

Now, let's assume that we still want to generate complex feeds using CSS Selectors anyway and not write a PHP bridge. I see that the changes will indeed break existing feeds, so CssSelector2Bridge makes sense. Here are some thoughts about the added features:

  • Cookies: Are you sure your cookie will not expire after some time, requiring you to recreate/update your feed? Seems more convenient to write a bridge in PHP that would first fetch the cookie automatically, then use it. I'd be curious of your use case for that feature in Css Selector Bridge 🤔

The website I was using to test this just required a cookie to enable loading the content. Just passing a cookie with a certain string would enable loading the content already. It didn't have to be valid. See issue #3612.

  • Timestamps: After giving it some thought, I did not implement them in the CSS Selector Bridge because of so many sites that use timestamps not compatible with strtotime(). Adding support for timestamps would work better if also adding support for custom date formats as additional setting. Also, as if it wasn't already complicated enough, the timestamp may be either on home page or on article page. So an additional switch would be needed (that makes 3 settings just for the timestamp). But then I realized that when timestamp is missing, feed aggregators take the time where the item is first discovered, which does the job well in most cases. If you prefer adding support for timestamps, consider adding format setting. That will surely be handy.

Good idea. I didn't feel like implementing it, but I can spend some time doing it.

  • Categories/Author: Note that in some case, they may be present on home page only.

True, but I don't think this is worth supporting right now.

  • remove_styling: Note that this is usually up to the feed reader application to do this.

Interesting, I didn't know that as I just starting using RSS. Do you recommend removing this feature?

The CssSelector2Bridge looks nice as is, so you are not required to make further changes. I'm just making suggestions. Just let @dvikan know when you are ready for merging.

👍

Copy link
Contributor

@ORelio ORelio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some additional feedback about the code:

bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
bridges/CssSelector2Bridge.php Outdated Show resolved Hide resolved
@ORelio
Copy link
Contributor

ORelio commented Aug 21, 2023

Note that if the site has a truncated RSS feed, you can use WordPressBridge to expand it and retain metadata such as author and timestamp from the feed. The site does not need to run WordPress for the bridge to work as you can specify a custom content selector if needed, so WordPressBridge should be preferred over CssSelectorBridge when possible.

I am not sure what you mean. I don't think I did anything to expand truncated feeds.

I meant, as a user, you may use WordPressBridge rather than CssSelectorBridge when the site has a truncated feed.

The website I was using to test this just required a cookie to enable loading the content. Just passing a cookie with a certain string would enable loading the content already. It didn't have to be valid. See issue #3612.

Oh, OK, got it. I did not see your issue, feel free to ping me whenever your issues are related to a bridge I maintain. This is rather strange, because the site may fail to be indexed properly in search engines if it does require a cookie.

Good idea. I didn't feel like implementing it, but I can spend some time doing it.

No pressure 😉

Interesting, I didn't know that as I just starting using RSS. Do you recommend removing this feature?

Then welcome! Feel free to first test your feed with and without styling in your feed reader to see if that makes a difference, and decide based on your results. From my experience, I can just say that most bridges in rss-bridge will not clean style/class attributes from the HTML content, but all RSS readers I tried will just ignore them.

@LarsStegman
Copy link
Contributor Author

I added the possibility to specify the time format.

I also tested not removing the styling, but my own RSS Reader (NetNewsWire) did not ignore it and it caused a large amount of whitespace to be contained in the article, so I kept that functionality.

@LarsStegman
Copy link
Contributor Author

@dvikan I am ready with the PR.

Co-authored-by: ORelio <ORelio@users.noreply.github.com>
@dvikan
Copy link
Contributor

dvikan commented Aug 22, 2023

find a less stupid name e.g. CssSelectorComplexBridge

@LarsStegman
Copy link
Contributor Author

Done

@dvikan dvikan merged commit eb4ff70 into RSS-Bridge:master Aug 22, 2023
@dvikan
Copy link
Contributor

dvikan commented Sep 14, 2023

Trying to get property 'innertext' of non-object at bridges/CssSelectorComplexBridge.php line 426

@ORelio
Copy link
Contributor

ORelio commented Sep 14, 2023

@dvikan Check your Author selector

$author = null;
if (!is_null($author_selector) && $author_selector != '') {
$author = trim($entry_html->find($author_selector, 0)->innertext);
}

@dvikan
Copy link
Contributor

dvikan commented Sep 21, 2023

please fix:

Trying to get property 'innertext' of non-object at bridges/CssSelectorComplexBridge.php line 421

@dvikan
Copy link
Contributor

dvikan commented Sep 22, 2023

@LarsStegman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants