Skip to content

Latest commit

 

History

History
76 lines (53 loc) · 2.75 KB

2_crawl_from_cc_news.md

File metadata and controls

76 lines (53 loc) · 2.75 KB

Table of Contents

How to crawl articles from CC-NEWS

This tutorial explains how to crawl articles from the CC-NEWS dataset using Fundus.

The crawler

To crawl articles from CC-NEWS simply import the CCNewsCrawler and stick to the same schema as with the main Fundus crawler. Now let's crawl a bunch of news articles from CC-NEWS using all available publishers supported in the Fundus PublisherCollection.

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
    print(article)

OS start method

Depending on the process start method used by your OS, you may have to wrap this crawl with a __name__ == "__main__" block.

from fundus import CCNewsCrawler, PublisherCollection

if __name__ == "__main__":
    crawler = CCNewsCrawler(*PublisherCollection)
    for article in crawler.crawl(max_articles=100):
        print(article)

This code will crawl 100 random articles from the entire date range of the CC-NEWS dataset.

Date range

Date range you may ask? Yes, you can specify a date range corresponding to the date the article was added to CC-NEWS. Let's crawl some articles that were crawled between 2020/01/01 and 2020/03/03.

from datetime import datetime

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection, start=datetime(2020, 1, 1), end=datetime(2020, 3, 1))
for article in crawler.crawl(max_articles=100):
    print(article)

Multiprocessing

The CC-NEWS dataset consists of multiple terabytes of articles. Due to the sheer amount of data, the crawler utilizes multiple processes. Per default, it uses all CPUs available in your system. You can alter the number of additional processes used for crawling with the processes parameter of CCNewsCrawler. For optimal performance, we recommend setting the amount of process used manually. A good rule of thumb is to allocate one process per 200 Mbps of bandwidth. This can vary depending on the actual speed of your cpu cores.

from fundus import CCNewsCrawler, PublisherCollection

# having a bandwidth of 950 Mbps you should set processes to 5 
crawler = CCNewsCrawler(*PublisherCollection, processes=5)

To omit multiprocessing, pass -1 to the processes parameter.

In the next section we will introduce you to the Article class.