-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(academy): add advanced crawling section with sitemaps and search #1217
base: master
Are you sure you want to change the base?
Changes from all commits
fada6bf
ddd3a3b
a6f7150
58f576f
9d531da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
--- | ||
title: Crawling sitemaps | ||
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper. | ||
menuWeight: 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this something that is custom created by Apify? I haven't seen this anywhere else |
||
paths: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't this supposed to be |
||
- advanced-web-scraping/crawling/crawling-sitemaps | ||
--- | ||
|
||
In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps. | ||
|
||
We will look at the following topics: | ||
|
||
- How to find sitemap URLs | ||
- How to set up HTTP requests to download sitemaps | ||
- How to parse URLs from sitemaps | ||
- Using Crawlee to get all URLs in a few lines of code | ||
|
||
## [](#how-to-find-sitemap-urls) How to find sitemap URLs | ||
|
||
Sitemaps are commonly restricted to contain a maximum of 50k URLs so usually, there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or have auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc. | ||
|
||
### [](#google) Google | ||
|
||
You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. The success of this approach depends on the website telling Google to index the sitemap file itself which is rather uncommon. | ||
|
||
### [](#robots-txt) robots.txt | ||
|
||
If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive. | ||
|
||
### [](#common-url-paths) Common URL paths | ||
|
||
You can try to iterate over common URL paths like: | ||
|
||
- /sitemap.xml | ||
- /product_index.xml | ||
- /product_template.xml | ||
- /sitemap_index.xml | ||
- /sitemaps/sitemap_index.xml | ||
- /sitemap/product_index.xml | ||
- /media/sitemap.xml | ||
- /media/sitemap/sitemap.xml | ||
- /media/sitemap/index.xml | ||
|
||
Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`). | ||
|
||
Some websites also provide an HTML version, to help indexing bots find new content. Those include: | ||
|
||
- /sitemap | ||
- /category-sitemap | ||
- /sitemap.html | ||
- /sitemap_index | ||
|
||
Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you so that you don't have to check manually. | ||
Check failure on line 53 in sources/academy/webscraping/advanced_web_scraping/crawling/crawling-sitemaps.md GitHub Actions / lint
|
||
|
||
## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if anchors do not differ from headings then these are unnecessary from what I remember |
||
|
||
For most sitemaps, you can make a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps are compressed and have to be streamed and decompressed. The code for that is fairly complicated so we recommend just [using Crawlee](#using-crawlee) which handles streamed and compressed sitemaps by default. | ||
|
||
## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps | ||
|
||
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact, or various special category sections). [This article](/academy/node-js/scraping-from-sitemaps.md) provides code examples for parsing sitemaps. | ||
|
||
## [](#using-crawlee) Using Crawlee | ||
|
||
Fortunately, you don't have to worry about any of the above steps if you use [Crawlee](https://crawlee.dev) which has rich traversing and parsing support for sitemap. Crawlee can traverse nested sitemaps, download, and parse compressed sitemaps, and extract URLs from them. You can get all URLs in a few lines of code: | ||
|
||
```javascript | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we switch it to ```js? Sometime back we changed this for consistency across Academy & Platform docs. I'll add this info to contributing guidelines. |
||
import { RobotsFile } from 'crawlee'; | ||
|
||
const robots = await RobotsFile.find('https://www.mysite.com'); | ||
|
||
const allWebsiteUrls = await robots.parseUrlsFromSitemaps(); | ||
``` | ||
|
||
## [](#next) Next up | ||
|
||
That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters, and pagination. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,63 @@ | ||||||
--- | ||||||
title: Sitemaps vs search | ||||||
description: Learn how to extract all of a website's listings even if they limit the number of results pages. | ||||||
menuWeight: 1 | ||||||
paths: | ||||||
- advanced-web-scraping/crawling/sitemaps-vs-search | ||||||
--- | ||||||
|
||||||
The core crawling problem comes to down to ensuring that we reliably find all detail pages on the target website or inside its categories. This is trivial for small sites. We just open the home page or category pages and paginate to the end as we did in the Web Scraping for Beginners course. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not entirely sure if |
||||||
|
||||||
Unfortunately, **most modern websites restrict pagination** only to somewhere between 1 and 10 thousand products. Solving this problem might seem relatively straightforward at first but there are multiple hurdles that we will explore in this lesson. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
There are two main approaches to solving this problem: | ||||||
Check failure on line 13 in sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md GitHub Actions / lint
|
||||||
|
||||||
- Extracting all page URLs from the website's **sitemap**. | ||||||
- Using **categories, search and filters** to split the website so we get under the pagination limit. | ||||||
|
||||||
Both of these approaches have their pros and cons so the best solution is to **use both and combine the results**. Here we will learn why. | ||||||
|
||||||
## Pros and cons of sitemaps | ||||||
|
||||||
Sitemap is usually a simple XML file that contains a list of all pages on the website. They are created and maintained mainly for search engines like Google to help ensure that the website gets fully indexed there. They are commonly located at URLs like `https://example.com/sitemap.xml` or `https://example.com/sitemap.xml.gz`. We will get to work with sitemaps in the next lesson. | ||||||
|
||||||
### Pros | ||||||
|
||||||
- **Quick to set up** - The logic to find all sitemaps and extract all URLs is usually simple and can be done in a few lines of code. | ||||||
- **Fast to run** - You only need to run a single request for each sitemap that contains up to 50,000 URLs. This means you can get all the URLs in a matter of seconds. | ||||||
- **Usually complete** - Websites have an incentive to keep their sitemaps up to date as they are used by search engines. This means that they usually contain all pages on the website. | ||||||
|
||||||
### Cons | ||||||
|
||||||
- **Does not directly reflect the website** - There is no way you can ensure that all pages on the website are in the sitemap. The sitemap also can contain pages that were already removed and will return 404s. This is a major downside of sitemaps which prevents us from using them as the only source of URLs. | ||||||
Check failure on line 32 in sources/academy/webscraping/advanced_web_scraping/crawling/sitemaps-vs-search.md GitHub Actions / lint
|
||||||
- **Updated in intervals** - Sitemaps are usually not updated in real-time. This means that you might miss some pages if you scrape them too soon after they were added to the website. Common update intervals are 1 day or 1 week. | ||||||
- **Hard to find or unavailable** - Sitemaps are not always trivial to locate. They can be deployed on a CDN with unpredictable URLs. Sometimes they are not available at all. | ||||||
- **Streamed, compressed, and archived** - Sitemaps are often streamed and archived with .tgz extensions and compressed with gzip. This means that you cannot use default HTTP client settings and must handle these cases with extra code. Fortunately, we will get to this in the next lesson. | ||||||
|
||||||
## Pros and cons of categories, search, and filters | ||||||
|
||||||
This approach means traversing the website like a normal user do by going through categories, setting up different filters, ranges and sorting options. The goal is to traverse it is a way that ensures we covered all categories/ranges where products can be located and for each of those we stayed under the pagination limit. | ||||||
|
||||||
The pros and cons of this approach are pretty much the opposite of the sitemaps approach. | ||||||
|
||||||
### Pros | ||||||
|
||||||
- **Directly reflects the website** - With most scraping use-cases, we want to analyze the website as the regular users see it. By going through the intended user flow, we ensure that we are getting the same pages as the users. | ||||||
- **Updated in real-time** - The website is updated in real-time so we can be sure that we are getting all pages. | ||||||
- **Often contain detailed data** - While sitemaps are usually just a list of URLs, categories, searches and filters often contain additional data like product names, prices, categories, etc, especially if available via JSON API. This means that we can sometimes get all the data we need without going to the detail pages. | ||||||
|
||||||
### Cons | ||||||
|
||||||
- **Complex to set up** - The logic to traverse the website is usually more complex and can take a lot of time to get right. We will get to this in the next lessons. | ||||||
- **Slow to run** - The traversing can require a lot of requests. Some filters or categories will have products we already found. | ||||||
- **Not always complete** - Sometimes the combination of filters and categories will not allow us to ensure we have all products. This is especially painful for sites where we don't know the exact number of products we are looking for. The framework we will build in the next lessons will help us with this. | ||||||
|
||||||
## Do we know how many products there are? | ||||||
|
||||||
Fortunately, most websites list a total number of detail pages somewhere. It might be displayed on the home page or search results or be provided in the API response. We just need to make sure that this number really represents the whole site or category we are looking to scrape. By knowing the total number of products, we can tell if our approach to scrape all succeeded or if we still need to refine it. | ||||||
|
||||||
Unfortunately, some sites like Amazon do not provide exact numbers. In this case, we have to work with what they give us and put even more effort into making our scraping logic accurate. We will tackle this in the next lessons as well. | ||||||
|
||||||
## [](#next) Next up | ||||||
|
||||||
First, we will look into the easier approach, the [sitemap crawling](./crawling-sitemaps.md). Then we will go through all the intricacies of the category, search and filter crawling, and build up a generic framework that we can use on any website. At last, we will combine the results of both approaches and set up monitoring and persistence to ensure we can run this regularly without any manual controls. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,32 @@ | ||
--- | ||
title: Advanced web scraping | ||
description: Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. | ||
sidebar_position: 6 | ||
description: Take your scrapers to a production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers. | ||
menuWeight: 6 | ||
category: web scraping & automation | ||
slug: /advanced-web-scraping | ||
paths: | ||
- advanced-web-scraping | ||
--- | ||
|
||
# Advanced web scraping | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If title in frontmatter does not differ from h1, h1 is unnecessary it will be automatically generated by docusaurus |
||
|
||
**Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.** | ||
In [**Web scraping for beginners**](/academy/web-scraping-for-beginners) course, we have learned the necessary basics required to create a scraper. In the following courses, we learned more about specific practices and techniques that will help us to solve most of the problems we will face. | ||
|
||
--- | ||
In this course, we will take all of that knowledge, add a few more advanced concepts, and apply them to learn how to build a production-ready web scraper. | ||
|
||
## [](#what-does-production-ready-mean) What does production-ready mean? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I remember correctly, headers should not use punctuation |
||
|
||
To scrape large and complex websites, we need to scale two essential aspects of the scraper: crawling and data extraction. Big websites can have millions of pages and the data we want to extract requires more sophisticated parsing techniques than just selecting elements by CSS selectors or using APIs as they are. | ||
|
||
<!-- | ||
The following sections will cover the core concepts that will ensure that your scraper is production-ready: | ||
The advanced crawling section will cover how to ensure we find all pages or products on the website. | ||
- The advanced data extraction will cover how to efficiently extract data from a particular page or API. | ||
--> | ||
|
||
In this course, we'll be tackling some of the most challenging and advanced web-scraping cases, such as mobile-app scraping, scraping sites with limited pagination, and handling large-scale cases where millions of items are scraped. Are **you** ready to take your scrapers to the next level? | ||
We will also touch on monitoring, performance, anti-scraping protections, and debugging. | ||
|
||
If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎 | ||
|
||
## First up {#first-up} | ||
## [](#first-up) First up | ||
|
||
This course's [first lesson](./scraping_paginated_sites.md) dives head-first into one of the most valuable skills you can have as a scraper developer: **Scraping paginated sites**. | ||
First, we will explore [advanced crawling section](academy/webscraping/advanced-web-scraping/advanced-crawling) that will help us to find all pages or products on the website. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like unnecesary dating ? When is "recently" ? Also I think this could work better as admonitions instead of blockquote.