scrpr is a lightweight scraper multitool. it can fetch data via https, detect changes and parse the most common formats.
const scrpr = require("scrpr");
const scraper = scrpr({
concurrency: 5,
cachedir: '/tmp/scraper-cache',
});
scraper("https://example.org/data.csv", {
parse: "csv",
}, function(err, change, data){
if (err) console.error(err);
if (change) console.log(data);
});
Constructor, returns scraper function
Opts:
concurrency
— number of parallel requests; default:1
cachedir
— directory to save cache data in; default:<root module>/.scrpr-cache
Scraper, delivers data
Opts:
method
— http method; default:get
url
— URL, alternative tourl
parameterheaders
— additional http request headers, default:{}
data
— http data to be sent, default:null
cache
— use cache, default:true
cacheid
— override cache id, default:hash(url, opts)
parse
— format to parse, default:null
(raw data)successCodes
— array of http status codes considered successful, default:[ 200 ]
needle
— options passed on toneedle
, default{}
xlsx
— options passed on toxlsx
, default{}
xsv
— options passed on toxsv
, default{}
pdf
— options passed on topdf.js-extract
, default{}
preprocess(data, callback(err, data))
— modify data before parsingpostprocess(data, callback(err, data))
— modify data after parsingstream
— deliver data asReadableStream
— no parsing or processing, default:false
metaredirects
— follow<meta http-equiv="refresh">
style redirects, default:false
iconv
— decode stream or data as this charset with iconv-lite before parsing, default:false
cooldown
— microseconds since last fetch before a resource is fetched again, default:false
sizechange
— treat unchanged content-length as same file, default:false
Callback:
err
— contains Error ornull
change
—true
if data changeddata
— raw or parsed data when changed, otherwise status string
csv
— Comma Seperated Values;data
is an Object, parsed with xsvtsv
— Tab Separated Values;data
is an Object, parsed with xsvssv
— Semicolon Separated Values (data has been exported "as csv" with some localizations of Microsoft Excel):data
is an Object, parsed with xsvxml
— eXtensible Markup Language;data
is an Object, parsed with xml2jsjson
— JavaScript object Notation;data
is an Object, parsed nativelyhtml
— HyperText Markup Language;data
is an instance of cheerioyaml
— YAML Ain't Markup Language;data
is an Object, parsed with yamlxlsx
— Office Open XML Workbook;data
is an Object, parsed with xlsx;{ "<sheetname>": [ [ cell, cell, cell, ... ], ... ] }
pdf
— Portable Document Format;data
is an Object, parsed with pdf.js-extract;kdl
— KDL Document Language;data
is an Object, parsed with kdljs;dw
— Datawrapper Visualisation;data
is an Object, extracted with dataunwrapper;
Rudimentary handling for ftp
URLs is available if the optional get-uri
dependency is installed.
Rudimentary handling for local files is available with the file:/
pseude-protocol.
xsv
, xlsx
, xml2js
, yaml
, cheerio
, dataunwrapper
, iconv-lite
, kdljs
, pdf.js-extract
and get-uri
are optional dependencies. They should only be installed if their use is required.