[FabriceBellard]: New Bridge #1220

somini · 2019-07-22T00:12:36Z

A feed for the particularly barebones (in terms of HTML, it's the opposite of the content) home page of Fabrice Bellard.

https://bellard.org/

This is the simplest HTML possible, I though this would be very easy, but it served to stress-test the HTML parser library more than I expected.

Relative links are not "absolutized" at all.

If a link's href property is apage.html, that's the exact string returned by ->href. At the very least, there should be a global function to do the equivalent of this, given a root domain:
https://github.com/somini/rss-bridge/blob/99a39ff2633dfa8e71f3cc2edbd17cd0575a14e0/bridges/FabriceBellardBridge.php#L19-L24

Worse, if a link is something like "http://gihub.com" (note the missing trailing slash, it is not parsed as a URI. This is hacked around here: https://github.com/somini/rss-bridge/blob/99a39ff2633dfa8e71f3cc2edbd17cd0575a14e0/bridges/FabriceBellardBridge.php#L26-L28

There is a strange issue with the ->find function, but only sometimes.

Consider the following HTML snippet:

<body>
   <p>This is paragraph 1
   <p>This is paragraph 2
   <p>This is paragraph 3</p>
</body>

$html->find('p') returns 3 elements, as expected, the innertext argument works as expected (This is paragraph #). However, if you do another find on that returned object, the "implied" paragraph end is ignored and it will continue to find elements until the end of the page. This is an hidden bug on the implementation, since I'm only interested on the first link of each element.

teromene · 2019-07-26T08:46:00Z

Concerning the link page, we do have the defaultLinkTo function that handles this.

There are still hacks needed, and it still fails sometimes...

somini · 2019-07-26T23:16:48Z

This is even documented: https://github.com/RSS-Bridge/rss-bridge/wiki/defaultLinkTo

mea culpa 😞

teromene · 2019-07-29T10:13:06Z

Merged, thanks !

* [FabriceBellard]: New Bridge

[FabriceBellard]: New Bridge

99a39ff

Use defaultLinkTo to improve HTML parsing

ec386dd

There are still hacks needed, and it still fails sometimes...

Appease Travis

8295662

teromene merged commit 990719d into RSS-Bridge:master Jul 29, 2019

somini deleted the new-bridge/bellard branch February 27, 2020 01:30

infominer33 pushed a commit to web-work-tools/rss-bridge that referenced this pull request Apr 17, 2020

[FabriceBellard]: New Bridge (RSS-Bridge#1220)

680f2f5

* [FabriceBellard]: New Bridge

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FabriceBellard]: New Bridge #1220

[FabriceBellard]: New Bridge #1220

somini commented Jul 22, 2019

teromene commented Jul 26, 2019 •

edited

Loading

somini commented Jul 26, 2019

teromene commented Jul 29, 2019

[FabriceBellard]: New Bridge #1220

[FabriceBellard]: New Bridge #1220

Conversation

somini commented Jul 22, 2019

teromene commented Jul 26, 2019 • edited Loading

somini commented Jul 26, 2019

teromene commented Jul 29, 2019

teromene commented Jul 26, 2019 •

edited

Loading