Crawler for www.lagou.com
Index the downloaded pages periodially.
The website only provide up to 30 pages of result(15 items in each one) at a time. Thus it needs to be executed once in a while. The script itself runs pretty fast (less than 1 second if not considering url request). This is built for deepen my understanding of the trending of job market, as well as preparing for the next cralwer project.
Time consumming with different modes (seconds per 100 requests)
- Coroutine(Asynchronous I/O): 0.8s
- Multi-thread: 3.0s
- Synchronous: 22.0s
- Implement newest Python library asyncio& aiohttp to achieve coroutine mode
- Python's Multi-threads mode is not bad, but unable to sufficiently take the advantage of multi-core processor when deploying it to the server.
- Use Python decorator to log running performance
- Low system cost.
- run init.py to initiate project
- setup crontab
- run crawlerXXX.py
- Python3
- aiohttp(pip install aiohttp)
- Sqlite