GDelt News Crawler

Crawls on a daily bases news articles that are indexed by the GDelt project (http://gdeltproject.org)

The program runs on a daily basis and crawls the news published one day before. The news are crawled from the events database provided by GDelt project on a daily basis (http://gdeltproject.org).

Operations

There are four main steps:

Daily Event Download http://data.gdeltproject.org/events/index.html
Crawling of all indexed HTML news articles
Boilerpipe execution on the extracted HTML documents
Indexing into Solr the cleaned news articles (stripped off from HTML tags). The documents have as fields: {"url", "date", "title", "content"}.

Running the crawler

In order to run the GDelt crawler, there are only a very few parameters that need to be specified. We list them below.

-output_dir: Specifies the base directory where all the extracted content is stored.
-threads: Specifies the number of threads with which to run the program. This is especially used when crawling the news articles in parallel. This value is dependent on the infrastructure used, a reasonable value is below 100.
-filter: Specifies the path to a ```\n``` delimted file with the possible suffixes of an HTML document which you want to ignore from the process (e.g. PDF, MOV, MP4 etc.)
-min_year: The minimum year from which to start crawling the GDelt data (the minimum possible year is 2013).
-min_month: The minimum month from which to start crawling the GDelt data.
-min_day: The minimum day from which to start crawling the GDelt data.
-server: The Solr server URL where you want to index the crawled news articles.

An example run of the program can be the following ```java -cp gdelt_crawler.jar:lib/* GdeltCrawler -output_dir . -threads 10 -filter ignore_suffixes.txt -min_year 2016 -min_month 05 -min_day 01 -server http://YOUR_SOLR_SERVER````

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src/main/java		src/main/java
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDelt News Crawler

Operations

Running the crawler

About

Releases

Packages

Languages

afel-project/gdelt_crawler

Folders and files

Latest commit

History

Repository files navigation

GDelt News Crawler

Operations

Running the crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages