Benchmarking CLI for Scrapy

(The project is still in development.)

A command-line interface for benchmarking Scrapy, that reflects real-world usage.

Why?

Currently, the scrapy bench option present just spawns a spider which aggressively crawls randomly generated links at a high speed.
The speed thus obtained, which maybe useful for comparisons, does not actually reflects a real-world scenario.
The actual speed varies with the python version and scrapy version.

Current Features

Spawns a CPU-intensive spider which follows a fixed number of links of a static snapshot of the site Books to Scrape.
Follows a real-world scenario where various information of the books is extracted, and stored in a .csv file.
A broad crawl benchmark that uses 1000 copies of the site Books to Scrape which are dynamically generated using twisted. The server file is present here.
A micro benchmark that tests LinkExtractor() function by extracting links from a collection of html pages.
A micro benchmark that tests extraction using css from a collection of html pages.
A micro benchmark that tests extraction using xpath from a collection of html pages
Profile the benchmarkers with vmprof and upload to their website

Options

--n-runs option for performing more than one iteration of spider to improve the precision.
--only_result option for viewing the results only.
--upload_result option to upload the results to local codespeed for better comparison.

Spider settings

SCRAPY_BENCH_RANDOM_PAYLOAD_SIZE: Adds a random payload with the given size (in bytes).

Setup

Setup server for Ubuntu

Firstly, download the static snapshot of the website Books to Scrape. That can be done by using wget.

  wget --mirror --convert-links --adjust-extension --page-requisites --no-parent \
      http://books.toscrape.com/index.html

Then place the whole file in the folder var/www/html:

  sudo ln -s `pwd`/books.toscrape.com/ /var/www/html/

nginx is required for deploying the website. Hence it is required to be installed and configured. If it is, you would be able to see the site here.

If not, then follow the given steps :

  sudo apt-get update
  sudo apt-get install nginx

For the broad crawl, use the server.py file to serve sites of local copy of Books to Scrape, which would already be in /var/www/html.

Setup server using docker

Build serve part using docker

  docker build -t scrapy-bench-server -f docker/Dockerfile .

Run docker container

  docker run --rm -ti --network=host scrapy-bench-server

Now you have nginx and serve.py serving

Client setup

Add the following entries to /etc/hosts file :

127.0.0.1    domain1
127.0.0.1    domain2
127.0.0.1    domain3
127.0.0.1    domain4
127.0.0.1    domain5
127.0.0.1    domain6
127.0.0.1    domain7
127.0.0.1    domain8
....................
127.0.0.1    domain1000

This would point the sites http://domain1:8880/index.html to the original site generated at http://localhost:8880/index.html.

There are 130 html files present in sites.tar.gz, which were downloaded using download.py from the top sites from Alexa top sites list.

There are 200 html files present in bookfiles.tar.gz, which were downloaded using download.py from the website Books to Scrape.

The spider download.py, dumps the response body as unicode to the files. The list of top sites was taken from here.

Do the following to complete the installation:

git clone https://github.com/scrapy/scrapy-bench.git
cd scrapy-bench/
virtualenv env
. env/bin/activate
pip install --editable .

Usage

Usage: scrapy-bench [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

  A benchmark suite for Scrapy.

Options:
  --n-runs INTEGER  Take multiple readings for the benchmark.
  --only_result     Display the results only.
  --upload_result   Upload the results to local codespeed
  --book_url TEXT   Use with bookworm command. The url to book.toscrape.com on
                    your local machine

  --vmprof          Profling benchmarkers with Vmprof and upload the result to
                    the web

  -s, --set TEXT    Settings to be passed to the Scrapy command. Use with the
                    bookworm/broadworm commands.

  --help            Show this message and exit.

Commands:
  bookworm         Spider to scrape locally hosted site
  broadworm        Broad crawl spider to scrape locally hosted sites
  cssbench         Micro-benchmark for extraction using css
  csv              Visit URLs from a CSV file
  itemloader       Item loader benchmarker
  linkextractor    Micro-benchmark for LinkExtractor()
  urlparseprofile  Urlparse benchmarker
  xpathbench       Micro-benchmark for extraction using xpath

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Benchmarking CLI for Scrapy

Why?

Current Features

Options

Spider settings

Setup

Setup server for Ubuntu

Setup server using docker

Client setup

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Benchmarking CLI for Scrapy

Why?

Current Features

Options

Spider settings

Setup

Setup server for Ubuntu

Setup server using docker

Client setup

Usage