web-scraper

Breadth first search web scraper written in Java using the JavaFX toolkit.

Features

Breadth first search from a starting URL
Customizable parsing settings
- Number of parallel threads
- Maximum link traversal depth
- Crawler timeout (lifetime)
- Delay between requests
- Optional to clear parsing queue before finishing
  - This will take a long time
Keep track of parsing status with simple statistics
- Total crawling time
- Number of unique pages saved
- Number of pages visited / number of pages queued
Output scraped data to a JSON file
View base url's HTML code to determine selectors
Selector view
- Set the JSON output format by settings variable names and CSS selectors
- Interactively test your selectors before starting the crawl
Graph View
- Get a deep understanding of the path the crawler took in a visual format
- Click any node to see the URL and data scraped from it
- Entertaining to watch

GUI

Settings View

HTML View

JSON Selectors View

Syntax

Special Selectors [type]
- title - get the page's title
- url - get the page`s url
Data Selectors [css selector]:[type]
- CSS Selectors
  - Use css syntax to select elements
  - div.class_name > h1 selects an h1 with a parent div of class class_name
- Types
  - text - get all text between the given element
    - <p>Hello <e>World!</e></p> -> Hello World!
  - owntext - get only the text between the given element
    - <p>Hello <e>World!</e></p> -> Hello
  - href - get link contained in href= attribute

Structure

The selector should be a valid JSON object. It must have a data and links tag.
Each data element you want to extract has a unique title to identify it
The links tag is an array of selectors pointing to anchor tags
- The crawler will use these links' href attribute to traverse from page to page.
- If you do not care what links you are selecting just use a to follow any anchor tag link.

Scraped Graph

Usage

Nodes are added to the graph in real time and in the order they are traversed
Select a node to view the URL and data associated with it in the dropdown
Nodes are colored according to depth. Nodes of the same color were found at the same depth

Libraries

Gradle
JSoup
Guava
Lombok
Gson
JavaFX
GraphStream
SLF4J / Logback

Development

The project uses the Gradle build system. Simply import the project into any IDE and run the "application -> run" task

Usage

Download a prebuilt binary to run on any platform with 0 dependencies
java -jar [jarfile].jar

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
.settings		.settings
gradle/wrapper		gradle/wrapper
screenshot		screenshot
src/main		src/main
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-scraper

Features

GUI

Settings View

HTML View

JSON Selectors View

Syntax

Structure

Scraped Graph

Usage

Libraries

Development

Usage

Initial Swing GUI

About

Releases 1

Packages

Languages

evan-buss/web-scraper

Folders and files

Latest commit

History

Repository files navigation

web-scraper

Features

GUI

Settings View

HTML View

JSON Selectors View

Syntax

Structure

Scraped Graph

Usage

Libraries

Development

Usage

Initial Swing GUI

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages