Collection of scrapper pipelines build for different purposes
- Architecture idea
- Asynchronous tasks
- Celery client :
<--->Celery client
<--->Celery worker
. Be connected to flask to the celery task, issue the commands for the tasks - Celery worker : A process that runs tasks in background, can be a
task (periodic task), and aasynchronous
(when API call) one. - Massage broker :
Celery client
<--Massage broker->Celery worker
. The Celery client will need to via Message worker to communicate with Celery worker. Here I useRedis
as the Message broker.
- Celery client :
Quick start via docker
# Run via docker
$ cd ~ && git clone
$ cd ~ && cd web_scraping && docker-compose -f docker-compose.yml up
- visit the services via
- flower UI : http://localhost:5555/
- Run "add" task : http://localhost:5001/add/1/2
- Run "web scrape" task : http://localhost:5001/scrap_task
- Run "indeed scrape" task : http://localhost:5001/indeed_scrap_task
Quick start manually
# Run manually
# STEP 1) open one terminal and run celery server locally
$ cd ~ && cd web_scraping/celery_queue
# run task from API call
$ celery -A tasks worker --loglevel=info
# run cron (periodic) task
$ celery -A tasks beat
# STEP 2) Run radis server locally (with the other terminal)
# make sure you have already installed radis
$ redis-server
# STEP 3) Run flower (with the other terminal)
$ cd ~ && cd web_scraping/celery_queue
$ celery flower -A tasks --address= --port=5555
# STEP 4) Add a sample task
# "add" task
$ curl -X POST -d '{"args":[1,2]}' http://localhost:5555/api/task/async-apply/tasks.add
# "multiply" task
$ curl -X POST -d '{"args":[3,5]}' http://localhost:5555/api/task/async-apply/tasks.multiply
# "scrape_task" task
$ curl -X POST http://localhost:5555/api/task/async-apply/tasks.scrape_task
# "scrape_task_api" task
$ curl -X POST -d '{"args":["mlflow","mlflow"]}' http://localhost:5555/api/task/async-apply/tasks.scrape_task_api
# "indeed_scrap_task" task
$ curl -X POST http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_task
# "indeed_scrap_api_V1" task
$ curl -X POST -d '{"args":["New+York"]}' http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_api_V1
├── Dockerfile
├── api. : Celery api (broker, job accepter(flask))
│ ├── Dockerfile : Dockerfile build celery api
│ ├── : Flask server accept job request(api)
│ ├── requirements.txt
│ └── : Celery broker, celery backend(redis)
├── celery-queue : Run main web scrapping jobs (via celery)
│ ├── Dockerfile : Dockerfile build celery-queue
│ ├── IndeedScrapper : Scrapper scrape
│ ├── requirements.txt
│ └── : Celery run scrapping tasks
├── docker-compose.yml : docker-compose build whole system : api, celery-queue, redis, and flower(celery job monitor)
├── legacy_project
├── logs : Save running logs
├── output : Save scraped data
├── requirements.txt
└── : Script auto push output to github via Travis
# Run Unit test # 1
$ pytest -v tests/
# ================================== test session starts ==================================
# platform darwin -- Python 3.6.4, pytest-5.0.1, py-1.5.2, pluggy-0.12.0 -- /Users/jerryliu/anaconda3/envs/yen_dev/bin/python
# cachedir: .pytest_cache
# rootdir: /Users/jerryliu/web_scraping
# plugins: cov-2.7.1, celery-4.3.0
# collected 10 items
# tests/ PASSED [ 10%]
# tests/ PASSED [ 20%]
# tests/ PASSED [ 30%]
# tests/ PASSED [ 40%]
# tests/ PASSED [ 50%]
# tests/ PASSED [ 60%]
# tests/ PASSED [ 70%]
# tests/ PASSED [ 80%]
# tests/ PASSED [ 90%]
# tests/ PASSED [100%]
# Run Unit test # 2
python tests/ -v
# test_addition (__main__.TestAddTask) ... ok
# test_task_state (__main__.TestAddTask) ... ok
# test_multiplication (__main__.TestMultiplyTask) ... ok
# test_task_state (__main__.TestMultiplyTask) ... ok
# ----------------------------------------------------------------------
# Ran 4 tests in 0.131s
# OK
- Celery : parallel/single thread python tasks management tool (celery broker/worker)
- Redis : key-value DB save task data
- Flower : UI monitor celery tasks
- Flask : python light web framework, as project backend server here
- Docker : build the app environment
### Project level
0. Deploy to Heroku cloud and make the scrapper as an API service
1. Dockerize the project
2. Run the scrapping (cron/paralel)jobs via Celery
4. Add test (unit/integration test)
5. Design DB model that save scrapping data systematically
### Programming level
1. Add utility scripts that can get XPATH of all objects in html
2. Workflow that automate whole processes
3. Job management
- Multiprocessing
- Asynchronous
- Queue
4. Scrapping tutorial
5. Scrapy, Phantomjs
### Others
1. Web scrapping 101 tutorial
Scraping via Celery
Travis push to github
Indeed scrapping
Distributed scrapping
Unit test Celery