Skip to content

Commit

Permalink
another try
Browse files Browse the repository at this point in the history
  • Loading branch information
metalwarrior665 committed Sep 18, 2024
1 parent 58f576f commit 9d531da
Show file tree
Hide file tree
Showing 4 changed files with 3 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ For most sitemaps, you can make a simple HTTP request and parse the downloaded X

## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps

The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/tutorials/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps.

## [](#using-crawlee) Using Crawlee

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,6 @@ await crawler.addRequests(requestsToEnqueue);

## Summary {#summary}

And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/platform/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.
And that's it. We have an elegant solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data](/academy/expert_scraping_with_apify/saving_useful_stats.md). This will let you know what filters you went through and how many products each of them had.

Check out the [full code example](https://github.com/apify-projects/apify-extra-library/tree/master/examples/crawler-with-filters).
2 changes: 0 additions & 2 deletions sources/academy/webscraping/advanced_web_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,3 @@ If you've managed to follow along with all of the courses prior to this one, the
## [](#first-up) First up

First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website.


Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import TabItem from '@theme/TabItem';

If you're trying to [collect data](../executing_scripts/extracting_data.md) on a website that has millions, thousands, or even hundreds of results, it is very likely that they are paginating their results to reduce strain on their back-end as well as on the users loading and rendering the content.

![Amazon pagination](/academy/webscraping/advanced_web_scraping/crawling/images/pagination.png)
![Amazon pagination](/academy/advanced_web_scraping/crawling/images/pagination.png)

## Page number-based pagination {#page-number-based-pagination}

Expand Down

0 comments on commit 9d531da

Please sign in to comment.