Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot parse a malformed feed #91

Open
anewuser opened this issue Apr 22, 2022 · 3 comments
Open

Cannot parse a malformed feed #91

anewuser opened this issue Apr 22, 2022 · 3 comments

Comments

@anewuser
Copy link
Contributor

This feed currently contains an encoding issue and Pipes cannot parse it: http://feedrinse.com/services/channel/?chanurl=882eadeaf9ef2636f65d656793114983 .

I've reported this to Feed Rinse, but could you take a look to see if you can find a workaround? SimplePie can handle it and just replaces the broken character with a normal question mark: https://www.simplepie.org/demo/?feed=http%3A%2F%2Ffeedrinse.com%2Fservices%2Fchannel%2F%3Fchanurl%3D882eadeaf9ef2636f65d656793114983

@onli
Copy link
Member

onli commented Apr 22, 2022

Hey. Not sure how to solve this. This is the error we get:

2022-04-22 20:27:29 - ArgumentError - invalid byte sequence in UTF-8:
	.../pipes/vendor/bundle/ruby/3.1.0/gems/rss-0.2.9/lib/rss/parser.rb:132:in `maybe_xml?'

The problematic String seems to be this:

<title>UNDER FALL JUSTICE オンライン限定シングル「壊れたオモチャ」2022年3月31日� ...</title>

I assume that the � marks that this was maybe a cut multi byte character?

There are some workarounds, but they involve setting the encoding of that string of the RSS feed with something like https://ruby-doc.org/core-2.7.0/String.html#method-i-encode or guessing the encoding with a gem. I'd be very worried about breaking a lot of things with that.

In general I agree that Pipes should just handle this feed somehow, but this is such a wide field of potential issues that hoping the input feed gets fixed seems like a better option to me. But input on how to safely solve this is welcome, I certainly might be wrong here.

@anewuser
Copy link
Contributor Author

anewuser commented Apr 24, 2022

Thank you for looking into it.

The problematic String seems to be this:

Yes, the original title says 31日発売.

hoping the input feed gets fixed seems like a better option to me

Alright. This issue is too specific to worry too much about it. FeedRinse has some internal filters too, and I can use them to remove problematic posts in case they don't fix their parser soon.

@anewuser
Copy link
Contributor Author

anewuser commented Aug 5, 2022

@onli What about this other case?

Something is causing one my subscriptions to add invalid XML entities to item titles. I believe it's always a series of &#0;. I've contacted the site owner about it, but he never replied. The problem goes away for a while but then returns.

Removing the invalid character entities from the code at the moment the feed is downloaded would be enough to fix it. I can't do it with a replace block, though. As soon as the code goes through a block, Pipes stops parsing it as XML.

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
  <title>Example</title>
  <link>https://example.com/</link>
  <description>Example feed</description>
  <item>
    <title>Bad title &#0;&#0;&#0;&#0;&#0;</title>
    <link>https://example.com/8325262</link>
    <description>An item with a bad title</description>
  </item>
  <item>
    <title>Good title</title>
    <link>https://example.com/4325262</link>
    <description>An item with a good title</description>
  </item>
</channel>
</rss>

@anewuser anewuser reopened this Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants