GitHub - Chenimal/crawlerLagou: Crawler for collecting data from www.lagou.com. Writen in Python

Crawler for www.lagou.com

Index the downloaded pages periodially.

Background

The website only provide up to 30 pages of result(15 items in each one) at a time. Thus it needs to be executed once in a while. The script itself runs pretty fast (less than 1 second if not considering url request). This is built for deepen my understanding of the trending of job market, as well as preparing for the next cralwer project.

Performance

Time consumming with different modes (seconds per 100 requests)

Coroutine(Asynchronous I/O): 0.8s
Multi-thread: 3.0s
Synchronous: 22.0s

Feature

Implement newest Python library asyncio& aiohttp to achieve coroutine mode
Python's Multi-threads mode is not bad, but unable to sufficiently take the advantage of multi-core processor when deploying it to the server.
Use Python decorator to log running performance
Low system cost.

Usage

run init.py to initiate project
setup crontab
run crawlerXXX.py

Requirement

Python3
aiohttp(pip install aiohttp)
Sqlite

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
bin		bin
lib @ 613ba2d		lib @ 613ba2d
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
crawlerAsync.py		crawlerAsync.py
crawlerBase.py		crawlerBase.py
crawlerSync.py		crawlerSync.py
crawlerThreads.py		crawlerThreads.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler for www.lagou.com

Background

Performance

Feature

Usage

Requirement

About

Releases

Packages

Languages

Chenimal/crawlerLagou

Folders and files

Latest commit

History

Repository files navigation

Crawler for www.lagou.com

Background

Performance

Feature

Usage

Requirement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages