Set of scripts for crawling newspaper websites. Please find the available scripts below
Site | URL | script |
---|---|---|
Nakkheeran | http://nakkheeran.in/ | tamil/crawler-nakkheeran.py |
Dailythanthi | http://dailythanthi.com/ | tamil/crawler-dailythanthi.py |
Tamil The Hindu | http://tamil.thehindu.com/ | tamil/crawler-tamil-hindu.py |
Puthiyathalaimurai | http://puthiyathalaimurai.com/ | tamil/crawler-puthiyathalaimurai.py |
Dinamani | http://dinamani.com/ | tamil/crawler-dinamani.py |
Site | URL | script |
---|---|---|
Manorama | http://www.manoramaonline.com/ | malayalam/crawler-manorama.py |
Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.
[ ] Extract common code into a decorator
pip3 install -r requirements.txt
crawler-viduthalai4.py
under tamil uses the latest MultiThreadedCrawler2.
<newspaper_name>
title.list --> acts as a index for other directories.
articles
-- 2018
---- Dec
---- May
-- 2017
---- Jun
---- Aug
-- 2016
---- Oct
---- Jan
abstracts
-- 2018
---- Dec
---- May
-- 2017
---- Jun
---- Aug
-- 2016
---- Oct
---- Jan