Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FabriceBellard]: New Bridge #1220

Merged
merged 3 commits into from
Jul 29, 2019
Merged

Conversation

somini
Copy link
Contributor

@somini somini commented Jul 22, 2019

A feed for the particularly barebones (in terms of HTML, it's the opposite of the content) home page of Fabrice Bellard.

https://bellard.org/

This is the simplest HTML possible, I though this would be very easy, but it served to stress-test the HTML parser library more than I expected.

  • Relative links are not "absolutized" at all.

If a link's href property is apage.html, that's the exact string returned by ->href. At the very least, there should be a global function to do the equivalent of this, given a root domain:
https://github.com/somini/rss-bridge/blob/99a39ff2633dfa8e71f3cc2edbd17cd0575a14e0/bridges/FabriceBellardBridge.php#L19-L24

Worse, if a link is something like "http://gihub.com" (note the missing trailing slash, it is not parsed as a URI. This is hacked around here: https://github.com/somini/rss-bridge/blob/99a39ff2633dfa8e71f3cc2edbd17cd0575a14e0/bridges/FabriceBellardBridge.php#L26-L28

  • There is a strange issue with the ->find function, but only sometimes.

Consider the following HTML snippet:

<body>
   <p>This is paragraph 1
   <p>This is paragraph 2
   <p>This is paragraph 3</p>
</body>

$html->find('p') returns 3 elements, as expected, the innertext argument works as expected (This is paragraph #). However, if you do another find on that returned object, the "implied" paragraph end is ignored and it will continue to find elements until the end of the page. This is an hidden bug on the implementation, since I'm only interested on the first link of each element.

@teromene
Copy link
Member

teromene commented Jul 26, 2019

Concerning the link page, we do have the defaultLinkTo function that handles this.

There are still hacks needed, and it still fails sometimes...
@somini
Copy link
Contributor Author

somini commented Jul 26, 2019

This is even documented: https://github.com/RSS-Bridge/rss-bridge/wiki/defaultLinkTo

mea culpa 😞

@teromene teromene merged commit 990719d into RSS-Bridge:master Jul 29, 2019
@teromene
Copy link
Member

Merged, thanks !

@somini somini deleted the new-bridge/bellard branch February 27, 2020 01:30
infominer33 pushed a commit to web-work-tools/rss-bridge that referenced this pull request Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants