Gopa

[狗爬], [aims to be] A high performance distributed and lightweight spider written in GO .

CHANGES

v0.9

breaking changes

move repo to infinitbyte/gopa, for better collaboration, namespace changed as well
separate API and UI, listen on different port
add mysql as database option

features

task fetch and update with stepped delay
add hash joint to crawler pipeline
dispatch tasks and auto update tasks
add proxy to fetch joint
filter url before push to checker
add rules config to url filter
support elasticsearch as database store
add task_deduplication in the check phrase

improvement

multi instance support on local machine
streamline clustering on local machine
modules and pipelines dynamic config ready
pipeline and context refactored to support dynamic parameters
save snapshot to KV store and update task management
optimize shutdown logic, reduce half of goroutines
add a wiki about how to build gopa on windows
remove timeout in queue by default
improve statsd performance with buffered client
refine log level, enable pprof to config listen address
update task ui, limit length of name
detect dead process, re-place lock file
persist auto-incremented id sequence to disk
simplified joint register

bugfix

disable simhash due to poor performance and memory leak
fix wrong relative url by using unicode index
fix statsd no data was send out
fix poor string merge performance

v0.8

features

raft clustering
dynamic change logging setting from the console, can be filter log by level, message, file and function name
dynamic create pipeline
add tls to security api and websocket
add proxy to crawler pipeline

improvement

use template engine, UI refactoring
add a logo

bugfix

fix incorrect stats number, incorrect task filter
fix incorrect redirect handler, url ignored

v0.7

features:

add stats api to expose the task info, http://localhost:8001/stats
add websocket and simple ui to interact with Gopa, http://localhost:8001/ui/
add task api to accept seed
dynamic change the seelog config via api, [GET/POST] http://localhost:8001/setting/seelog/
follow 301/302 redirect, and continue fetch
add boltdb status page, http://localhost:8001/ui/boltdb
add pipeline framework to create crawler
add command to dynamic change logging level and add seed url
export metrics to statsD
support daemon mode in linux and darwin
add task management api

improvement:

add update_ui setup to Makefile in order to build static ui
add git commit log and build_date to gopa binary
console ui support websocket reconnect

v0.6

breaking change:

remove bloom, use leveldb to store urls

feature:

crawling speed control
cookie supported
brief logging format

bugfix:

shutdown nil exception
wrong relative link in parse phrase

v0.5

feature:

ruled fetch
fetch/parse offset can be persisted and reloadable
http console

v0.4

improvement:

refactor storage interface,data path are now configable
disable pprof by default
use local storage instead of kafka,kafka will be removed later
check local file's exists first before fetch the remote page

bugfix:

resolve memory leak caused by sbloom filter

feature:

download by url template
list page download

v0.3

adding golang pprof, http://localhost:6060/debug/pprof/

go tool pprof http://localhost:6060/debug/pprof/heap
go tool pprof http://localhost:6060/debug/pprof/profile
go tool pprof http://localhost:6060/debug/pprof/block

integrate with kafka to make task controllable and recoverable
parameters configable
goroutine can be controlled now

v0.2

bloom-filter persistence
building script works

v0.1

just up and run.