Awesome Web Scraper

A collection of awesome web scaper, crawler.

Java

Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
websphinx - Website-Specific Processors for HTML INformation eXtraction.
Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.

C/C++

HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.

C#

ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.

Erlang

ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.

Python

scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
gdom - gdom, DOM Traversing and Scraping using GraphQL.
trafilatura - Library and command-line tool to extract metadata, main text, and comments.
extractnet - machine learning based content & metadata extraction framework for Python
Scrapegraph-ai - An open source library for making scraping with the use of the AI

PHP

Goutte - Goutte, a simple PHP Web Scraper.
DiDOM - Simple and fast HTML parser.
simple_html_dom - Just a Simple HTML DOM library fork.
PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
Crawler - A library for Rapid Web Crawler and Scraper Development.

Nodejs

puppeteer - Headless Chrome Node API https://pptr.dev.
Phantomjs - Scriptable Headless WebKit.
node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
node-simplecrawler - Flexible event driven crawler for node.
spider - Programmable spidering of web sites with node.js and jQuery.
slimerjs - A PhantomJS-like tool running Gecko.
casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
zombie - Insanely fast, full-stack, headless browser testing using node.js.
nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
jsdom - A JavaScript implementation of the WHATWG DOM and HTML standards, for use with node.js
xray - The next web scraper. See through the <html> noise.
lightcrawler - Crawl a website and run it through Google lighthouse.

Ruby

wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.

Go

gocrawl - Polite, slim and concurrent web crawler.
fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Rust

scraper - HTML parsing and querying with CSS selectors.
reqwest - An ergonomic, batteries-included HTTP Client for Rust.

License

MIT

Contributing

Please, read the Contribution Guidelines before submitting your suggestion.

Feel free to open an issue or create a pull request with your additions.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
renovate.json		renovate.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Web Scraper

Java

C/C++

C#

Erlang

Python

PHP

Nodejs

Ruby

Go

Rust

License

Contributing

About

Releases

Sponsor this project

Packages

Contributors 10

License

duyet/awesome-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Awesome Web Scraper

Java

C/C++

C#

Erlang

Python

PHP

Nodejs

Ruby

Go

Rust

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 10

Packages