[狗爬], [aims to be] A high performance distributed and lightweight spider written in GO .
- move repo to infinitbyte/gopa, for better collaboration, namespace changed as well
- separate API and UI, listen on different port
- add mysql as database option
- task fetch and update with stepped delay
- add hash joint to crawler pipeline
- dispatch tasks and auto update tasks
- add proxy to fetch joint
- filter url before push to checker
- add rules config to url filter
- support elasticsearch as database store
- add task_deduplication in the check phrase
- multi instance support on local machine
- streamline clustering on local machine
- modules and pipelines dynamic config ready
- pipeline and context refactored to support dynamic parameters
- save snapshot to KV store and update task management
- optimize shutdown logic, reduce half of goroutines
- add a wiki about how to build gopa on windows
- remove timeout in queue by default
- improve statsd performance with buffered client
- refine log level, enable pprof to config listen address
- update task ui, limit length of name
- detect dead process, re-place lock file
- persist auto-incremented id sequence to disk
- simplified joint register
- disable simhash due to poor performance and memory leak
- fix wrong relative url by using unicode index
- fix statsd no data was send out
- fix poor string merge performance
- raft clustering
- dynamic change logging setting from the console, can be filter log by level, message, file and function name
- dynamic create pipeline
- add tls to security api and websocket
- add proxy to crawler pipeline
- use template engine, UI refactoring
- add a logo
- fix incorrect stats number, incorrect task filter
- fix incorrect redirect handler, url ignored
- add stats api to expose the task info, http://localhost:8001/stats
- add websocket and simple ui to interact with Gopa, http://localhost:8001/ui/
- add task api to accept seed
- dynamic change the seelog config via api, [GET/POST] http://localhost:8001/setting/seelog/
- follow 301/302 redirect, and continue fetch
- add boltdb status page, http://localhost:8001/ui/boltdb
- add pipeline framework to create crawler
- add command to dynamic change logging level and add seed url
- export metrics to statsD
- support daemon mode in linux and darwin
- add task management api
- add update_ui setup to Makefile in order to build static ui
- add git commit log and build_date to gopa binary
- console ui support websocket reconnect
- remove bloom, use leveldb to store urls
- crawling speed control
- cookie supported
- brief logging format
- shutdown nil exception
- wrong relative link in parse phrase
- ruled fetch
- fetch/parse offset can be persisted and reloadable
- http console
- refactor storage interface,data path are now configable
- disable pprof by default
- use local storage instead of kafka,kafka will be removed later
- check local file's exists first before fetch the remote page
- resolve memory leak caused by sbloom filter
- download by url template
- list page download
- adding golang pprof, http://localhost:6060/debug/pprof/
- go tool pprof http://localhost:6060/debug/pprof/heap
- go tool pprof http://localhost:6060/debug/pprof/profile
- go tool pprof http://localhost:6060/debug/pprof/block
- integrate with kafka to make task controllable and recoverable
- parameters configable
- goroutine can be controlled now
- bloom-filter persistence
- building script works
- just up and run.