Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加sitemap.xml(网站地图)解析功能,其读取主要和robots.txt差不多 #143

Open
sairson opened this issue Mar 17, 2023 · 1 comment
Labels
feature New feature or request

Comments

@sairson
Copy link

sairson commented Mar 17, 2023

增加两个结构体用于sitemap.xml内容解析

type Sitemap struct {
	URLs    []LocUrl `xml:"url"`
	Sitemap []LocUrl `xml:"sitemap"`
}

type LocUrl struct {
	Loc string `xml:"loc"`
}

之后在获取的返回包body后

sitemap := Sitemap{}
	if err := xml.NewDecoder(strings.NewReader(resp.ToText())).Decode(&sitemap); err != nil {
		return result, errors.Wrap(err, "could not decode xml")
	}
	for _, v := range sitemap.URLs {
		url, err := urllib.GetURL(regexp.MustCompile(`(/.+)`).FindString(strings.Trim(v.Loc, " \t\n")), *navRequest.URL)
		if err != nil {
			continue
		}
		request := parse.GetRequest(enums.GET, url)
		request.Source = enums.FromSitemap
		_ = callback(request)
		result = append(result, request)
	}

	for _, v := range sitemap.Sitemap {
		url, err := urllib.GetURL(regexp.MustCompile(`(/.+)`).FindString(strings.Trim(v.Loc, " \t\n")), *navRequest.URL)
		if err != nil {
			continue
		}
		request := parse.GetRequest(enums.GET, url)
		request.Source = enums.FromSitemap
		_ = callback(request)
		result = append(result, request)
	}
	return result, nil
@sairson
Copy link
Author

sairson commented Mar 17, 2023

@Qianlitp Qianlitp added the feature New feature or request label Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants