Homework 2 - Web Archiving

Due: In 3 weeks

Important: Q1 requires obtaining a Twitter Developer Account (see Twitter Setup). This may take a day or two, so if you haven't already done this, get started today.

This assignment is going to take time. Read through the entire assignment before starting. Do not wait until the last minute to start working on it!

Assignment

Write a report that contains the answers and explains how you arrived at the answers to the following questions.

Note about Programming Tasks: For several of the programming tasks this semester, you will be asked to write code to operate on 100s or 1000s of data elements. If you have not done this type of development before, I strongly encourage you to start small and work your way up. Especially when you are using new tools or APIs, start on a small test dataset to make sure you understand how to use the tool and that your processing scripts are working before ramping up to the full set. This will save you an enormous amount of time.

Q1. Collect URIs from Tweets.

Extract 1000 unique links from tweets in Twitter.

Setup for this task:

Obtain a Twitter Developer Account (see Twitter Setup)
Use the example scripts (collect-tweets.py, process-tweets.py) referenced in EC 0.7 (and copied into this repo) as starter code for your assignment. Modify them as needed.

Main steps:

Write a Python program that collects English-language tweets that contain links. See Collecting Tweets.
Write a Python program that extracts the links shared in tweets. See Extracting Links from Tweets.
Resolve all URIs to their final target URI (i.e., the one that responds with a 200). See Resolve URIs to Final Target URI.
Save only unique final URIs (no repeats). See Save Only Unique URIs.
- if after this step, you don't have 1000 unique URIs, go back and gather more until you are able to get at least 1000 unique URIs
Save this collection of 1000 unique links in a file and upload it to your repo in GitHub -- we'll use it again in HW3

Collecting Tweets

You'll likely need to collect more than 1000 tweets initially to get 1000 unique links.

There are rate limits (number of API calls per amount of time) associated with different types of API calls to Twitter, but twarc will handle the rate limits for you. This means, though, that if you ask twarc to deliver more tweets than it is allowed in a certain amount of time, it will pause until it's able to complete your request.

To deal with the rate limits, my suggestion is to choose a few keywords and collect a subset of the target number of tweets for each of those keywords. For example, you could collect 250 tweets each about 5 different keywords. Use keywords that you might actually search for (ex: covid, olympics, vaccine) rather than "stopwords" (ex: test, the, tweet).

Extracting Links from Tweets

Links in tweets are stored in the ['entities']['urls'] part of the tweet data structure. This has several components:

'url' - The shortened URI (usually starting with https://t.co/)
'expanded_url' - The actual URI that was input by the user (i.e., not shortened)
'display_url' - The text of the URI that is displayed in the tweet (counted as part of the 280-character limit in the tweet)

Since we want the actual URIs, you want to extract the 'expanded_url' version of the link. There's an example in process-tweets.py.

We will be analyzing the content in these links in a later assignment, so you want links that will likely contain some text.

Exclude links from the Twitter domain (twitter.com) -- these will likely be references to other tweets or images
Exclude links that will likely point to a video/audio-only page (youtube.com, twitch.com, soundcloud.com, etc.)

If you find a link you consider to be inappropriate for any reason, just discard it and get some more links.

Resolve URIs to Final Target URI

Many of the links that you collect will be shortened links (dlvr.it, bit.ly, buff.ly, etc.). We want the final URI that resolves to an HTTP 200 (not a redirection). For example:

$ curl -IL --silent http://bit.ly/wc-wail | egrep -i "(HTTP/1.1|HTTP/2|^location:)"
HTTP/1.1 301 Moved Permanently
Location: http://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
HTTP/1.1 301 Moved Permanently
Location: https://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html
HTTP/2 200

We want https://ws-dl.blogspot.com/2013/07/2013-07-10-warcreate-and-wail-warc.html, not http://bit.ly/wc-wail.

You can either write a Unix shell script that uses curl to do this, or write a Python program using the requests library. If you use the Python requests library, make sure to include the timeout parameter to your call to get().

Example: requests.get(url, timeout=5) # 5 second timeout

Save Only Unique URIs

You can write Python code for this part, but I'd recommend using the Unix tools sort and uniq. Back to Basics: Sort and Uniq is a nice introduction to this.

Q2. Get TimeMaps for Each URI.

Obtain the TimeMaps for each of the unique URIs from Q1 using the ODU Memento Aggregator, MemGator.

You may use https://memgator.cs.odu.edu for limited testing, but do not request all of your 1000 TimeMaps from memgator.cs.odu.edu.

There are two options for running MemGator locally:

Install a stand-alone version of MemGator on your own machine, see https://github.com/oduwsdl/MemGator/releases
- This was described in EC0.8
Install Docker Desktop and run MemGator as a Docker Container, see notes at https://github.com/oduwsdl/MemGator/blob/master/README.md

Important: Obtaining TimeMaps requires contacting several different web archives for each URI-R. This process will take time. Look at the MemGator options and figure out how to process the output before running the entire process. You might want to get JSON output, or you might want to limit to the top k archives (especially if there's one that's currently taking a long time to return).

Note that if there are no mementos for a URI-R, MemGator will return nothing. Don't be surprised if many of your URI-Rs return 0 mementos. Remember the "How Much of the Web is Archived" slides -- there are lots of things on the web that are not archived. If you want to do a sanity check on a few, you can manually use the Wayback Machine and see what you get from the Internet Archive. (Remember though that MemGator is going to query several web archives, not just Internet Archive.)

If you uncover TimeMaps that are very large (e.g., for popular sites like https://www.cnn.com/) and swamp your filesystem, you have two options:

Manually remove those URI-Rs from your dataset (but note this in your report), or
Compress each TimeMap file individually (using pipe to gzip in the same command when downloading or after the download is completed). These compressed files can be used for further analysis by decompressing on the fly using commands like zcat or zless (or using gzip libraries in Python).

Q3. Analyze Mementos Per URI-R.

Use the TimeMaps you saved in Q2 to analyze how well the URIs you collected in Q1 are archived.

Create a table showing how many URI-Rs have certain number of mementos. For example

Mementos	URI-Rs
0	750
1	100
7	50
12	25
19	25
24	20
30	27
57	3

If you will end up with a very large table of memento counts, you can bin the number of mementos. Just make sure that the bin sizes are reasonable and that you specify how many had 0 mementos individually. The target is to have no more than 15-20 rows so that your table can fit on a single page. For example

Mementos	URI-Rs
0	750
1-10	150
11-20	50
21-30	47
57	3

Q: What URI-Rs had the most mementos? Did that surprise you?

Q4. Analyze Datetimes of Mementos.

For each of the URI-Rs from Q3 that had > 0 mementos, use the saved TimeMap to determine the datetime of the earliest memento.

Create a scatterplot with the age of each URI-R (days between collection date and earliest memento datetime) on the x-axis and number of mementos for that URI-R on the y-axis. For this graph, the item is the URI-R and the attributes are the estimated age of the URI-R (channel is horizontal position) and the number of mementos for that URI-R (channel is vertical position).

An example is shown below:

This scatterplot should be created using either R or Python, not Excel.

Q: What can you say about the relationship between the age of a URI-R and the number of its mementos?

Q: What URI-R had the oldest memento? Did that surprise you?

Q: How many URI-Rs had an age of < 1 week, meaning that their first memento was captured the same week you collected the data?

Extra Credit

Q5. Explore Conifer and ReplayWeb.Page

Create an account at Conifer and create a collection. Archive at least 10 webpages related to a common topic that you find interesting. Make the collection public and include the link to your collection in your report.

Q: Why did you choose this particular topic?

Q: Did you have any issues in archiving the webpages?

Q: Do the archived webpages look like the original webpages?

After creating your collection at Conifer, download the collection as a WARC file (see Exporting or Downloading Content).

Then load this WARC file into ReplayWeb.page, a tool from the Webrecorder Project (folks who developed Conifer). From https://webrecorder.net/tools:

ReplayWeb.page provides a web archive replay system as a single web site (which also works offline), allowing users to view web archives from anywhere, including local computer or even Google Drive. See the User guide for more info.

Once the WARC file has loaded, click on the "Pages" tab. Take a screenshot that includes the list of pages and the browser address bar (showing replayweb.page/?source=file%3A%2F%2F..., which indicates that the WARC file is being loaded from your local computer).

Then click on the "URLs" tab and choose "All URLs" from the dropdown menu.

Q: How many URLs were archived in the WARC file? How does this compare to the number of Pages?

Create a bar chart showing the number of URLs in the WARC file for each of the file types in the dropdown menu.

Q: Which file type had the most URLs? Were you surprised by this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HW2.md

HW2.md

Homework 2 - Web Archiving

Assignment

Q1. Collect URIs from Tweets.

Collecting Tweets

Extracting Links from Tweets

Resolve URIs to Final Target URI

Save Only Unique URIs

Q2. Get TimeMaps for Each URI.

Q3. Analyze Mementos Per URI-R.

Q4. Analyze Datetimes of Mementos.

Extra Credit

Q5. Explore Conifer and ReplayWeb.Page

Files

HW2.md

Latest commit

History

HW2.md

File metadata and controls

Homework 2 - Web Archiving

Assignment

Q1. Collect URIs from Tweets.

Collecting Tweets

Extracting Links from Tweets

Resolve URIs to Final Target URI

Save Only Unique URIs

Q2. Get TimeMaps for Each URI.

Q3. Analyze Mementos Per URI-R.

Q4. Analyze Datetimes of Mementos.

Extra Credit

Q5. Explore Conifer and ReplayWeb.Page