Skip to content
This repository has been archived by the owner on May 11, 2023. It is now read-only.
/ news-crawler Public archive

A news site crawler for extracting article information

Notifications You must be signed in to change notification settings

digitalstain/news-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

news-crawler

A news site crawler for extracting article information

This is mostly an attempt to get the hang of Ruby and how it fits in the context of the data journalistic approach i am trying to take.

The function the script is supposed to perform is that of parsing the RSS feeds of 3 major newsrooms in Greece, fetch the articles and extract information such as title, publication date and most importantly, provided source. The result gets stored in a sqlite3 database which has a very simple schema - see source for details.

To run this, you need a Ruby runtime environment and bundler installed. After cloning, execute

bundle install

followed by

ruby crawler.rb

This will go off and fetch all the latest news from the 3 sites, storing them in a database in the same directory - a file news.db will be created to hold it.

That's about it for now, if there is interest in this work, i'll share more information.

About

A news site crawler for extracting article information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages