TorrentSpider添加encoding key #3584

zhzero-hub · 2024-12-21T06:57:12Z

TorrentSpider在使用PyQuery解析xml时，如果xml含有“encoding=xxx”字段，会因为xml的encoding问题报错，这个问题可以通过将传给PyQuery的html_text重新进行编码解决，即html_doc = PyQuery(html_text) -> html_doc = PyQuery(html_text.encode('utf-8'))
因此给indexer添加一个encoding字段，当其为True时进行encode，同时默认为False避免对已有代码产生影响

InfinityPacer · 2024-12-21T07:08:50Z

@zhzero-hub 麻烦提供一下具体站点场景

zhzero-hub · 2024-12-21T07:32:19Z

比如这种：

<title>Mikan</title> Mikan is a CHINESE Public torrent tracker for ANIME https://maouzhkami.xyz/ en-US search 所有xml中返回带encoding的都会有问题

InfinityPacer · 2024-12-21T07:33:13Z

原始编码是什么？

zhzero-hub · 2024-12-21T07:45:34Z

原始编码就是utf-8，这个和内容的原始编码无关，get_decoded_html_content是能解码出来的
https://github.com/jxxghp/MoviePilot/blob/v2/app/modules/indexer/spider/__init__.py#L245
这里的ret如果是个xml格式的内容且含有encoding字段，那么self.parse(page_source)一定会报错
单纯是PyQuery在解析xml格式的内容时出现的问题，即便给PyQuery加上parser=xml这个参数也无法修复

zhzero-hub · 2024-12-21T07:56:40Z

粘贴一个来自gpt的回答：
PyQuery 报错 None Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration. 的原因是你尝试将一个包含编码声明（例如）的 Unicode 字符串传递给 PyQuery。PyQuery 不支持这种格式。

问题原因
PyQuery 的输入要求是以下之一：

字节对象（bytes）。
没有编码声明的 Unicode 字符串。
HTML 或 XML 文件的路径。
如果你传递的字符串中包含 XML 编码声明，例如：

<?xml version="1.0" encoding="utf-8"?> <root> <data>Example</data> </root>

PyQuery 会抛出上述错误。

InfinityPacer · 2024-12-28T16:23:42Z

Ourbits 出现乱码，请重新评估方案

zhzero-hub · 2024-12-28T17:07:01Z

请问该问题属于回归测试问题（原来是好的，因该mr产生的乱码），还是属于新站点适配问题。
如果是前者，我理解原先的逻辑应该全部适用，该mr在原先TorrentSpider配置全部不变的情况下不会产生任何影响；
如果是后者，请问TorrentSpier.parse参数html_text的内容对应的是什么格式的文件，如果方便，能否提供具体内容。

InfinityPacer · 2024-12-28T17:29:16Z

请问该问题属于回归测试问题（原来是好的，因该mr产生的乱码），还是属于新站点适配问题。如果是前者，我理解原先的逻辑应该全部适用，该mr在原先TorrentSpider配置全部不变的情况下不会产生任何影响；如果是后者，请问TorrentSpier.parse参数html_text的内容对应的是什么格式的文件，如果方便，能否提供具体内容。

from pyquery import PyQuery

html_text = """
<!DOCTYPE html>
<html lang="zh">
<head>
    <meta name="generator" content="NexusPHP"/>
    <meta name="robots" content="noindex, nofollow, noarchive, nosnippet">
        <title>OurBits :: 登录 - Powered by NexusPHP</title>
</head>
"""

html_doc = PyQuery(html_text.encode('utf-8'))

print(str(html_doc))

历史站点出现问题
HTML中 head 没有 charset 如 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 即可重现
此入口，正常站点响应均为 HTML

zhzero-hub · 2024-12-29T03:34:51Z

谢谢提供详细说明。这种情况下，设置indexer的encoding字段为false即可，即直接走html_doc = PyQuery(html_text)的逻辑，不再需要重新进行encode，lxml解析html有独立的代码，支持探测编码方式。
encoding=true的逻辑是针对非html的字符串中指定了编码的情况，以xml为例，xml的第一行通常是：<?xml version="1.0" encoding="utf-8"?>，当中指定了encoding为utf-8，这种字符串交给PyQuery解析就会产生该mr本身希望解决的问题，因为PyQuery使用的lxml在解析xml时对第一行内容非常敏感，比如不允许出现编码方式，甚至如果给xml的字符串第一行加入一个换行符都能避免该问题，比如：

html_text = """<?xml version="1.0" encoding="utf-8"?>“”“
html_doc = PyQuery(html_text) # 会报错，因为第一行存在encoding字段
html_doc = PyQuery("\n"+html_text) # 不会报错，因为第一行为空，xml字段被解析成一般组件

但由于我无法判断加入换行是否会对已有站点解析产生影响，因此选择了加入一个字段来判断，保证原有运行逻辑正常

TorrentSpider添加encoding key

cb07550

jxxghp merged commit fd5fbd7 into jxxghp:v2 Dec 21, 2024

jxxghp added a commit that referenced this pull request Dec 29, 2024

rollback #3584

c49e79d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TorrentSpider添加encoding key #3584

TorrentSpider添加encoding key #3584

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 28, 2024

zhzero-hub commented Dec 28, 2024

InfinityPacer commented Dec 28, 2024

zhzero-hub commented Dec 29, 2024 •

edited

Loading

TorrentSpider添加encoding key #3584

TorrentSpider添加encoding key #3584

Conversation

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

zhzero-hub commented Dec 21, 2024

InfinityPacer commented Dec 28, 2024

zhzero-hub commented Dec 28, 2024

InfinityPacer commented Dec 28, 2024

zhzero-hub commented Dec 29, 2024 • edited Loading

zhzero-hub commented Dec 29, 2024 •

edited

Loading