Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在release提供的chm的 数学特殊函数 中,部分应有页面不存在 #28

Open
tatianyi opened this issue Jan 10, 2024 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@tatianyi
Copy link

以“beta函数”为例,打开这一链接,将会提示“无法访问此页 确保 Web 地址//ieframe.dll/dnserrordiagoff.htm# 正确”。
在左侧的目录中,也看不到这些函数介绍的页面的存在,疑似未正确整合进chm中。

@myfreeer myfreeer added bug Something isn't working help wanted Extra attention is needed labels Jan 14, 2024
@myfreeer
Copy link
Owner

This seems to be a redirection issue.
这似乎是重定向问题。
image
The page cpp/numeric/special_math/beta, as an example, has been redirected to cpp/numeric/special_functions/beta in upstream website, but it does not appear in downloaded path for special_functions, only in special_math.
cpp/numeric/special_math/beta页为例,上游网站已将其重定向为 cpp/numeric/special_functions/beta,但它并未出现在 special_functions 的下载路径中,而仅出现在 special_math 中。
image
image
Wget should download page for both path since they are referenced, this could be an issue with wget's internal deduplication algorithm where links before and after redirection got marked for deduplication but only path before redirection got downloaded.
Without patching wget itself, it should be possible to parse cppreference-export-ns0,4,8,10.xml since it would be exported here, find all redirections and manually fix the page, but it should require a lot of work.
Wget应该下载两个路径的页面,因为它们都被引用了,这可能是wget内部重复数据删除算法的问题,重定向前后的链接都被标记为重复数据,但只有重定向前的路径被下载了。
在不修补 wget 本身的情况下,应该可以解析 cppreference-export-ns0,4,8,10.xml (因为它已导出 此处),找到所有重定向并手动修复页面,但这需要大量工作。

@myfreeer
Copy link
Owner

myfreeer commented Jan 14, 2024

Having searched over the wget mailing list and found that --trust-server-names can make a difference on behaviour or redirected urls.
I built a http server in node.js as a demo, see code below for details:
我搜索了wget 邮件列表,发现--trust-server-names会对行为或重定向 urls 产生影响。
我用 node.js 创建了一个 http 服务器作为演示,详情请见下面的代码:

Test code for server

const http = require('http');

// `/`: a page with links to `/301.html`, `302.html`, `subpage.html`, and imgs with src of `/301.gif`, `302.gif`
// `/301.html`: http 301 redirect to `subpage.html`
// `/302.html`: http 302 redirect to `subpage.html`
// `/301.gif`: http 301 redirect to `image.gif`
// `/302.gif`: http 302 redirect to `image.gif`
// `subpage.html`: a html page with whatever content
// `image.gif`: a small gif with color
const server = http.createServer((req, res) => {
  const { url, method } = req;

  if (url === '/') {
    // Route: /
    res.setHeader('Content-Type', 'text/html');
    res.write('<!DOCTYPE html><html>');
    res.write('<body>');
    res.write('<p>Click on the links below:</p>');
    res.write('<a href="/301.html">301 Redirect</a><br>');
    res.write('<a href="/302.html">302 Redirect</a><br>');
    res.write('<a href="/subpage.html">subpage.html</a><br>');
    res.write('<img src="/301.gif" style="width: 30px" alt="301 Redirect Image"><br>');
    res.write('<img src="/302.gif" style="width: 30px" alt="302 Redirect Image"><br>');
    res.write('<img src="/image.gif" style="width: 30px" alt="Image"><br>');
    res.write('</body>');
    res.write('</html>');
    res.end();
  } else if (url === '/301.html') {
    // Route: /301.html
    res.writeHead(301, { Location: '/subpage.html' });
    res.end();
  } else if (url === '/302.html') {
    // Route: /302.html
    res.writeHead(302, { Location: '/subpage.html' });
    res.end();
  } else if (url === '/301.gif') {
    // Route: /301.gif
    res.writeHead(301, { Location: '/image.gif' });
    res.end();
  } else if (url === '/302.gif') {
    // Route: /302.gif
    res.writeHead(302, { Location: '/image.gif' });
    res.end();
  } else if (url === '/subpage.html') {
    // Route: /subpage.html
    res.setHeader('Content-Type', 'text/html');
    res.write('<html>');
    res.write('<body>');
    res.write('<h1>This is subpage.html</h1>');
    res.write('<p>Whatever content you want goes here.</p>');
    res.write('</body>');
    res.write('</html>');
    res.end();
  } else if (url === '/image.gif') {
    // Route: /image.gif
    res.setHeader('Content-Type', 'image/gif');
    res.write(Buffer.from('iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==', 'base64'));
    res.end();
  } else {
    // Handle 404 Not Found
    res.writeHead(404, { 'Content-Type': 'text/plain' });
    res.end('404 Not Found');
  }
});

const port = 3000;
server.listen(port, () => {
  console.log(`Server is listening on port ${port}`);
});

Here are result tested with wget:
下面是用 wget 测试的结果:

trust-server-names Path before redirection Path after redirection
N Y N
Y N Y

It seems in wget we can not keep both path before redirection and path after redirection in the same run, making 2 runs would greatly increase the downloading time and server load, so the better option is still what mentioned above.
在 wget 中,我们似乎无法在同一次运行中同时保留重定向前的路径和重定向后的路径,进行两次运行会大大增加下载时间和服务器负载,因此更好的选择仍然是 上文
Edit: links inside html seemly fixed by convert-links, this should worth a try before next release.
编辑:"convert-links "似乎修复了 html 中的链接,在下次发布前值得一试。

Raw wget log

# wget -V
GNU Wget 1.21.4 built on cygwin.

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls
+ntlm +opie +psl +ssl/gnutls

Wgetrc:
    /etc/wgetrc (system)
Locale:
    /usr/share/locale
Compile:
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib -DNDEBUG
    -march=nocona -msahf -mtune=generic -O2 -pipe
Link:
    gcc -DNDEBUG -march=nocona -msahf -mtune=generic -O2 -pipe -pipe
    -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz -lpsl ../lib/libgnu.a
    -liconv -lintl /usr/lib/libunistring.dll.a

Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://www.gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Originally written by Hrvoje Niksic <hniksic@xemacs.org>.
Please send bug reports and questions to <bug-wget@gnu.org>.

# wget  --adjust-extension --page-requisites --convert-links --force-directories --recursive --level=10 -e robots=off --span-hosts --timeout=5 --tries=180 --no-verbose --retry-connrefused --waitretry=2 --read-timeout=13  --page-requisites  http://127.0.0.1:3000/
2024-01-14 12:47:04 URL:http://127.0.0.1:3000/ [393] -> "127.0.0.1+3000/index.html" [1]
2024-01-14 12:47:04 URL:http://127.0.0.1:3000/subpage.html [98] -> "127.0.0.1+3000/301.html" [1]
2024-01-14 12:47:04 URL:http://127.0.0.1:3000/subpage.html [98] -> "127.0.0.1+3000/302.html" [1]
2024-01-14 12:47:04 URL:http://127.0.0.1:3000/image.gif [70] -> "127.0.0.1+3000/301.gif" [1]
2024-01-14 12:47:04 URL:http://127.0.0.1:3000/image.gif [70] -> "127.0.0.1+3000/302.gif" [1]
FINISHED --2024-01-14 12:47:04--
Total wall clock time: 0.02s
Downloaded: 5 files, 729 in 0.002s (286 KB/s)

# wget  --adjust-extension --page-requisites --convert-links --force-directories --recursive --level=10 -e robots=off --span-hosts --timeout=5 --tries=180 --no-verbose --retry-connrefused --waitretry=2 --read-timeout=13  --trust-server-names http://127.0.0.1:3000/
2024-01-14 12:47:09 URL:http://127.0.0.1:3000/ [393] -> "127.0.0.1+3000/index.html" [1]
2024-01-14 12:47:09 URL:http://127.0.0.1:3000/subpage.html [98] -> "127.0.0.1+3000/subpage.html" [1]
2024-01-14 12:47:09 URL:http://127.0.0.1:3000/subpage.html [98] -> "127.0.0.1+3000/subpage.html" [1]
2024-01-14 12:47:09 URL:http://127.0.0.1:3000/image.gif [70] -> "127.0.0.1+3000/image.gif" [1]
2024-01-14 12:47:09 URL:http://127.0.0.1:3000/image.gif [70] -> "127.0.0.1+3000/image.gif" [1]
FINISHED --2024-01-14 12:47:09--
Total wall clock time: 0.02s
Downloaded: 5 files, 729 in 0.002s (308 KB/s)

@myfreeer
Copy link
Owner

20240915-dev 已发布,与 2024.09 相比增加了 --trust-server-names 选项。
20240915-dev has been released,added --trust-server-names option compared to 2024.09.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants