Skip to content

Frequently Asked Questions

benoit74 edited this page Oct 18, 2024 · 17 revisions

How does Zimit work?

At its heart, Zimit is a web scraper (or web spider). It starts a real browser and explores a website mostly like a user would.

The only difference is that this is done in a more systematic manner, with what is called a crawler (for the curious minds, the crawler used by Zimit is developed by the Webrecorder team and named Browsertrix Crawler).

The crawler is based on a processing queue of URLs that are to be crawled. It starts with one URL (page) in the queue, named the seed page, and normally stops when the processing queue is empty.

For every page, the crawler will:

  • load the page in its browser;
  • capture all resources fetched when loading the page, i.e.
    • HTML, CSS and JS code, but also all assets like images and fonts;
    • these resources are then saved in a set of WARC files, which is a standard archive for web content (see https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/)
    • search for links to other pages;
    • add every link that is both new (i.e. not already processed) and matching user exploration criteria to the processing queue.

The process repeats over and over until all pages have been explored and all content has been saved to WARC files.

Once the crawler completes its run, Zimit starts a conversion from its collection of WARC files to a single ZIM archive. This is done with a tool developed by openZIM named warc2zim. If you ignore all the subtleties this step really is a file format conversion.

What is important to configure in Zimit?

Two things are essential for proper Zimit operation:

  1. the seed URL, i.e. where you want to start the website exploration
  • this is obviously mandatory, otherwise Zimit won't know what you want to archive;
  • this will be the homepage of your ZIM;
  1. the exploration criteria
  • while important, a default value is proposed, so it is not mandatory but you will frequently have to customize it.

How do I set exploration criteria?

The exploration criteria is very important since it tells the scraper which pages must be explored, and where to stop.

There are mostly three solutions to configure the exploration criteria:

  1. the --include and --exclude regular expressions;
  2. the --scopeType value;
  3. rely on default settings.

All details are explained in the Browsertrix Crawler documentation

The --include and --exclude regular expressions

These regular expressions are what the crawler really uses to decide which pages are to be crawled.

While very powerful, configuring them properly requires a certain amount of skill (to learn a bit (or a lot) about regular expressions, check out https://www.freecodecamp.org/news/practical-regex-guide-with-real-life-examples).

The crawler can use a list of regular expression, but for now Zimit only supports one value for include and one value for exclude.

Every time the crawler finds a new page URL (not assets, see below for details) to crawl it performs the following checks:

  • is the URL matching at least one include regular expression?
    • if no, the page is not considered for later exploration;
    • if yes, is the URL matching at least one exclude regular expression?
      • if no, the page is added to the processing queue;
      • if yes, the page is not considered for later crawling.

The --scopeType

Scope is a simplification to avoid writing regular expressions for situation that is encountered on a regular basis. It is, in fact, transformed into an --include regular expression by the crawler.

Default setting: prefix

If you do not specify any --scope, --include or --exclude regular expression, then Zimit defaults to using a prefix scope.

Is it possible to use a --scope at the same time as --include / --exclude?

The scope is just a simplification to spare users from having to write regular expressions. It is transformed into an --include regular expression by zimit itself as the crawler only uses include and exclude regular expressions. Since the crawler can support an unlimited number of --include and exclude expressions, it is possible to simultaneously use the scope, include and exclude parameters.

I have many assets hosted on external websites, do I adapt the exploration criteria?

You probably should not adapt the exploration criteria for images and other assets hosted on external websites / domain names. Remember that the exploration criterion only specifies which links or pages have to be visited. Once a page has been crawled, every asset needed for its display are automatically loaded, no matter where they are located.

The only exception is assets which are not displayed (and hence loaded) on the page, but only linked to. For instance links to PDF are usually not considered as a page resource and hence not loaded. You will have to configure the crawler to explore them one by one.

A basic rule of thumb (a bit simplified, but mostly correct) is that every asset located in a src attribute will be automatically fetched while assets located in a href attribute have to be explored and hence added to the exploration criteria.

How may I hide some non-working functionality ?

It is quite common to have some stuff non-working inside the ZIM because they rely on an external service. Typically search functions are usually not working because they have to call an online service which for sure cannot be crawled and pushed inside a ZIM. You may want to consider to hide such input zones and stuff like that.

This is possible through the use of a custom CSS style sheet which will be added as an override to every HTML pages. The CSS selectors of this file must hence be specific enough to modify only what you intend to modify. Usually the modification simply consists in adding display:none; for every block which has to be hidden.

The custom style sheet can then be passed to zimit / warc2zim with the --custom-css CLI flag, either as a path to a local file (present on the machine running the scraper) or a URL to an online location.

Sample CSS stylesheet:

div.searchbox {
    display: none;
}
div.flag {
    display: none;
}

What is the best solution to create a custom Cascading Style Sheets (CSS)?

Use a dedicated Web browser extension like Stylus to customise live the CSS and one it works like expected, then save it locally.

Videos are not playing

If your video is not hosted on Youtube, this is a known limitation of the scraper.

If your video is hosted on Youtude and displays "Sign in to confirm you’re not a bot. This helps protect our community", this means the machine you ran zimit on has been blocked by Youtube, see https://github.com/openzim/zimit/issues/397

The ZIM is not created and logs says "Seed Page Load Error"

This is a very broad error meaning the URL you entered as "entrypoint" for the crawler process failed to load. Please double check the URL for any typo. If URL is correct, there could be many reasons:

  • website is running on a private network and the machine running zimit has no access to it => change to another machine which has access to the website
  • website is protected by user/password => you need to create a login profile with credentials stored within
  • website is protected by a WAF / DDoS / ... (Cloudflare, ...) => if you can, disable protection for the IP of machine running zimit
  • very slow server => consider configuring delays to bigger values (even if they are already quite significant by default)
  • temporary DNS issue or webservice issue => retry later

Error LookupError: unknown encoding: xxxx

This error happens when the website uses in its HTML files an encoding which is not known by Python.

This encoding is detected either in content headers for HTML documents or in HTTP header Content-Type.

If you can't fix the website (which is recommend for compliance with HTML standards, Python knows mostly all standard encodings), you might want to use advanced switches of the scraper (best ones depends on where the website has an issue):

  • Charsets to try / --charsets-to-try
  • Ignore Content Header Charsets / --ignore-content-header-charsets
  • Length of content header / --content-header-bytes-length
  • Ignore HTTP Header Charsets / --ignore-http-header-charsets

For instance, we regularly encounter unicode encodings, which means very little. Usually it is UTF-8 which is used, but we never knows.

Some links are not pointing inside the ZIM but to the online website

There could be many reasons why a link is pointing to the online website, but basically the problem is that the corresponding page is not inside the ZIM so the scraper kept original URL so that those with internet connection or with the capability to transfer the link to an online machine can still see what is there.

To diagnose the issue, first of all, you should check if the link pointing to the online website is supposed to be inside the ZIM. Quite often, it is possible to have misconfigured the scope (scopeType, include, exclude settings) and the crawler just supposed this page does not have to be captured.

If the link is supposed to be there, you can try to look at the logs to check that browsertrix crawler properly captured the linked page without error, and that there is not specific error about this linked page in warc2zim. For this, just search for the linked page URL without scheme in the logs (e.g. if linked page is https://www.acme.com/folder/page.html, search for www.acme.com/folder/page.html).

If you find that there is no specific error on this page in the log, then you might want to run again a much smaller crawl with just the specific page where you encounter a problem. The typical configuration to be fool proof consist in using --url <url_of_current_page> --depth 1 --scopeType custom --include ".*" which basically says "create a ZIM of this page and all pages found in a link on this page". <url_of_current_page is the URL of the page where you're observing that some links are missing. Then check again the ZIM and the logs. Should the problem persist, open an issue with the command you just ran, details about the link which is not working and the logs of the command you just ran. If the problem is solved, then it could be a configuration issue on your initial crawl, a temporary issue on upstream server, a temporary ban of your worker IP, or many other things.