Skip to content

Latest commit

 

History

History
28 lines (21 loc) · 2.4 KB

seednsort.md

File metadata and controls

28 lines (21 loc) · 2.4 KB

Seeding and Sorting Overview

What do Seeders/Sorters do?

Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension.

Choosing the website

The Seeders/Sorters team will use the EDGI subprimer systems (found here), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more.

Canvassing the website and evaluating content

Crawlable URLs

Wherever possible, add in the Agency Office Code. Talk to the DataRescue organizers to learn more.

Uncrawlable URLs

  • If URL is judged not crawlable, add it to the "Uncrawlable" spreadsheet through the Chrome Extension.
  • In the spreadsheet is automatically associated with a universal unique identifyer (UUID) that was generated in advance.
  • You can check whether the page or some files are rendered using the Internet Archive's Wayback Machine Chrome Extension

Not sure?

  • This sorting is only provisional: when in doubt seeders nominate the URL and mark it as possibly not crawlable.