Newspaper Crawler Scripts

Set of scripts for crawling newspaper websites. Please find the available scripts below

Available scripts.

Tamil

Site	URL	script
Nakkheeran	http://nakkheeran.in/	tamil/crawler-nakkheeran.py
Dailythanthi	http://dailythanthi.com/	tamil/crawler-dailythanthi.py
Tamil The Hindu	http://tamil.thehindu.com/	tamil/crawler-tamil-hindu.py
Puthiyathalaimurai	http://puthiyathalaimurai.com/	tamil/crawler-puthiyathalaimurai.py
Dinamani	http://dinamani.com/	tamil/crawler-dinamani.py

Malayalam

Site	URL	script
Manorama	http://www.manoramaonline.com/	malayalam/crawler-manorama.py

Contribute

Scripts for more news websites are welcome. Please save the text scraped in UTF-8 encoding. Please refer to the newspapers list file and pick one to scrape.

Todo

[ ] Extract common code into a decorator

Setup

pip3 install -r requirements.txt

Latest Script

crawler-viduthalai4.py under tamil uses the latest MultiThreadedCrawler2.

Directory structure

<newspaper_name>
  title.list --> acts as a index for other directories.
  articles
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan
  abstracts
  -- 2018
  ---- Dec
  ---- May
  -- 2017
  ---- Jun
  ---- Aug
  -- 2016
  ---- Oct
  ---- Jan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Newspaper Crawler Scripts

Available scripts.

Tamil

Malayalam

Contribute

Todo

Setup

Latest Script

Directory structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

Newspaper Crawler Scripts

Available scripts.

Tamil

Malayalam

Contribute

Todo

Setup

Latest Script

Directory structure