Skip to content

Latest commit

 

History

History
236 lines (179 loc) · 9.31 KB

data_summary.md

File metadata and controls

236 lines (179 loc) · 9.31 KB

Top News!

Two types of scraping and parsing jobs:

  1. Current (continuous, till 2016 election day)
  2. Past (Internet Archive)

The final data is posted here

Back

Current (2016)

Organization

  • Homepage

    • current-homepage-html.tar.gz
    • current-output-homepage.csv
  • Politics Homepage

    • current-politics-homepage-html.tar.gz
    • current-output-politics-homepage.csv
  • Top 10

    • current-top10-html.tar.gz (split by news_org/harvested pages from links to top10)
    • current-output-top10.csv

What is being collected?

Three kinds of things being scraped:

Homepages

Politics Homepages

Top 10

Internet Archive (till 2016)

Organization

Homepage

  • HTML files:

    • 2012: ia-homepage-html-2012.tar.gz
  • CSV files (compressed in gzip)

    • 2012: ia-output-homepage-2012-text.csv.gz
    • 2016: ia-output-homepage-2016-text.csv.gz

Politics Homepage

  • HTML file: ia-politics-html.tar.gz
  • CSV output: ia-output-politics-homepage-2012-2016-notext.csv.gz (without text data)

Top 10

  • HTML files:
    • ia-news-top10-html.tar.gz (split by news_org/harvested pages from links to top10)
    • ia-politics-top10-html.tar.gz
    • ia-top10-html.tar.gz
  • CSV files
    • ia-output-politics-top10-text-all.csv
      • it's the Top10 of politics news that we have scraped and parsed from Internet Archive for 2012 and 2016 between Jul 1 and Nov 30.
    • ia-output-top10-text-all.csv
      • it's the Top10 of all news for year 2012 and 2016 between Jul 1 and Nov 30. But articles not scraped and parsed --- there are 56k links that need to be scraped/parsed.

Scraped Homepage Data Summary

The frequency with which Internet Archive takes snapshots of different websites varies for unknown reasons. Here are the total number of snapshots for each kind of page: * Source: nyt, 31129 snapshots * Source: wsj, 15573 snapshots * Source: fox, 16838 snapshots * Source: hpmg, 26667 snapshots * Source: usat, 26545 snapshots * Source: google, 0 snapshots (Page cannot be displayed due to robots.txt) * Source: yahoo, 13991 snapshots


What was collected?

  1. Yahoo:
  1. Google:
  1. USA Today:
  1. WSJ
  1. Fox News
  1. HuffPo:
  1. NYT:
  1. WaPo: --- http://web.archive.org/web/*/washingtonpost.com seems to run into robots.txt