Skip to content

pyotruk/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

Build

gradle build fatJar

Run

java -jar build/libs/webcrawler-all-1.0.jar startURL depth [poolSize=10]
Example: java -jar build/libs/webcrawler-all-1.0.jar http://ya.ru/ 3 100

TODOs

  1. Add parent_id column to Page for hierarchy building.
  2. Check global uniqueness of URL before JPA-transaction.
  3. Kill URLs that is not global unique before they generated children.
  4. Fix 'GC overhead limit exceeded' when depth > 4.

Releases

No releases published

Packages

No packages published

Languages