Skip to content

Latest commit

 

History

History
234 lines (164 loc) · 7.18 KB

README.md

File metadata and controls

234 lines (164 loc) · 7.18 KB

puppet-scraper

github release npm version



PuppetScraper is a opinionated wrapper library for utilizing Puppeteer to scrape pages easily, bootstrapped using Jared Palmer's tsdx.

Most people create a new scraping project by require-ing Puppeteer and create their own logic to scrape pages, and that logic will get more complicated when trying to use multiple pages.

PuppetScraper allows you to just pass the URLs to scrape, the function to evaluate (the scraping logic), and how many pages (or tabs) to open at a time. Basically, PuppetScraper abstracts the need to create multiple page instances and retrying the evaluation logic.

Version 0.1.0 note: PuppetScraper was initially made as a project template rather than a wrapper library, but the core logic is still the same with various improvements and without extra libraries, so you can include PuppetScraper in your project easily using npm or yarn.

Brief example

Here's a basic example on scraping the entries on first page Hacker News:

// examples/hn.js

const { PuppetScraper } = require('puppet-scraper');

const ps = await PuppetScraper.launch();

const data = await ps.scrapeFromUrl({
  url: 'https://news.ycombinator.com',
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

console.log({ data });

await ps.close();

View more examples on the examples directory.

Usage

Installing dependency

Install puppet-scraper via npm or yarn:

$ npm install puppet-scraper
      --- or ---
$ yarn add puppet-scraper

Install peer dependency puppeteer or Puppeteer equivalent (chrome-aws-lambda, untested):

$ npm install puppeteer
      --- or ---
$ yarn add puppeteer

Instantiation

Create the PuppetScraper instance, either launching a new browser instance, connect or use an existing browser instance:

const { PuppetScraper } = require('puppet-scraper');
const Puppeteer = require('puppeteer');

// launches a new browser instance
const instance = await PuppetScraper.launch();

// connect to an existing browser instance
const external = await PuppetScraper.connect({
  browserWSEndpoint: 'ws://127.0.0.1:9222/devtools/browser/...',
});

// use an existing browser instance
const browser = await Puppeteer.launch();
const existing = await PuppetScraper.use({ browser });

Customize options

launch and connect has the same props with Puppeteer.launch and Puppeteer.connect, but with an extra concurrentPages and maxEvaluationRetries property:

const { PuppetScraper } = require('puppet-scraper');

const instance = await PuppetScraper.launch({
  concurrentPages: 3,
  maxEvaluationRetries: 10
  headless: false,
});

concurrentPages is for how many pages/tabs will be opened and use for scraping.

maxEvaluationRetries is for how many times the page will try to evaluate the given function on evaluateFn (see below), where if the evaluation throws an error, the page will reload and try to re-evaluate again.

If concurrentPages and maxEvaluationRetries is not determined, it will use the default values:

export const DEFAULT_CONCURRENT_PAGES = 3;
export const DEFAULT_EVALUATION_RETRIES = 10;

Scraping single page

As shown like the example above, use .scrapeFromUrl and pass an object with the following properties:

  • url: string, page URL to be opened
  • evaluateFn: function, function to evaluate (scraper method)
  • pageOptions: object, Puppeteer.DirectNavigationOptions props to override page behaviors
const data = await instance.scrapeFromUrl({
  url: 'https://news.ycombinator.com',
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

pageOptions defaults the waitUntil property to networkidle0, which you can read more on the API documentation.

Scraping multiple pages

Same as .scrapeFromUrl but passes urls property which contain strings of URL:

  • urls: string[], page URLs to be opened
  • evaluateFn: function, function to evaluate (scraper method)
  • pageOptions: object, Puppeteer.DirectNavigationOptions props to override page behaviors
const urls = Array.from({ length: 5 }).map(
  (_, i) => `https://news.ycombinator.com/news?p=${i + 1}`,
);

const data = await ps.scrapeFromUrls({
  urls,
  evaluateFn: () => {
    let items = [];

    document.querySelectorAll('.storylink').forEach((node) => {
      items.push({
        title: node.innerText,
        url: node.href,
      });
    });

    return items;
  },
});

Closing instance

When there's nothing left to do, don't forget to close the instance with closes the browser:

await instance.close();

Access the browser instance

PuppetScraper also exposes the browser instance if you want to do things manually:

const browser = instance.___internal.browser;

Contributing

Thanks goes to these wonderful people (emoji key):


Griko Nibras

💻 🚧

This project follows the all-contributors specification. Contributions of any kind welcome!

License

MIT License, Copyright (c) 2020 Griko Nibras