Skip to content

Crawls on a daily bases news articles that are indexed by the GDelt project (http://gdeltproject.org)

Notifications You must be signed in to change notification settings

afel-project/gdelt_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

GDelt News Crawler

Crawls on a daily bases news articles that are indexed by the GDelt project (http://gdeltproject.org)

The program runs on a daily basis and crawls the news published one day before. The news are crawled from the events database provided by GDelt project on a daily basis (http://gdeltproject.org).

Operations

There are four main steps:

  • Daily Event Download http://data.gdeltproject.org/events/index.html
  • Crawling of all indexed HTML news articles
  • Boilerpipe execution on the extracted HTML documents
  • Indexing into Solr the cleaned news articles (stripped off from HTML tags). The documents have as fields: {"url", "date", "title", "content"}.

Running the crawler

In order to run the GDelt crawler, there are only a very few parameters that need to be specified. We list them below.

  • -output_dir: Specifies the base directory where all the extracted content is stored.
  • -threads: Specifies the number of threads with which to run the program. This is especially used when crawling the news articles in parallel. This value is dependent on the infrastructure used, a reasonable value is below 100.
  • -filter: Specifies the path to a ```\n``` delimted file with the possible suffixes of an HTML document which you want to ignore from the process (e.g. PDF, MOV, MP4 etc.)
  • -min_year: The minimum year from which to start crawling the GDelt data (the minimum possible year is 2013).
  • -min_month: The minimum month from which to start crawling the GDelt data.
  • -min_day: The minimum day from which to start crawling the GDelt data.
  • -server: The Solr server URL where you want to index the crawled news articles.

An example run of the program can be the following ```java -cp gdelt_crawler.jar:lib/* GdeltCrawler -output_dir . -threads 10 -filter ignore_suffixes.txt -min_year 2016 -min_month 05 -min_day 01 -server http://YOUR_SOLR_SERVER````

About

Crawls on a daily bases news articles that are indexed by the GDelt project (http://gdeltproject.org)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages