Web crawler for Node.JS, both HTTP and HTTPS are supported.
npm install js-crawler
The crawler provides intuitive interface to crawl links on web sites. Example:
var Crawler = require("../crawler.js");
new Crawler().configure({depth: 3})
.crawl("http://www.google.com", function onSuccess(page) {
console.log(page.url);
});
The call to configure
is optional, if it is omitted the default option values will be used.
onSuccess
callback will be called for each page that the crawler has crawled. page
value passed to the callback will contain the following fields:
url
- URL of the pagecontent
- body of the page (usually HTML)status
- the HTTP status code
It is possible to pass an extra callback to handle errors, consider the modified example above:
var Crawler = require("../crawler.js");
new Crawler().configure({depth: 3})
.crawl("http://www.google.com", function(page) {
console.log(page.url);
}, function(response) {
console.log("ERROR occurred:");
console.log(response.status);
console.log(response.url);
});
Here the second callback will be called for each page that could not be accessed (maybe because the corresponding site is down). status
may be not defined.
depth
- the depth to which the links from the original page will be crawled. Example: ifsite1.com
contains a link tosite2.com
which contains a link tosite3.com
,depth
is 2 and we crawl fromsite1.com
then we will crawlsite2.com
but will not crawlsite3.com
as it will be too deep.
The default value is 2
.
ignoreRelative
- ignore the relative URLs, the relative URLs on the same page will be ignored when crawling, so/wiki/Quick-Start
will not be crawled andhttps://github.com/explore
will be crawled. This option can be useful when we are mainly interested in sites to which the current sites refers and not just different sections of the original site.
The default value is false
.
shouldCrawl
- function that specifies whether an url should be crawled, returnstrue
orfalse
.
Example:
var Crawler = require("../crawler.js");
var crawler = new Crawler().configure({
shouldCrawl: function(url) {
return url.indexOf("reddit.com") < 0;
}
});
crawler.crawl("http://www.reddit.com/r/javascript", function(page) {
console.log(page.url);
});
Default value is a function that always returns true
.
MIT License (c) Anton Ivanov
The crawler depends on the following Node.JS modules: