Seeders and Sorters canvass the resources of a given government agency, identifying important URLs. They identify whether those URLs can be crawled by the Internet Archive's webcrawler. If the URLs are crawlable, the Seeders/Sorters nominate them to the End-of-Term (EOT) project, otherwise they add them to the Uncrawlable spreadsheet using the project's Chrome Extension.
The Seeders/Sorters team will use the EDGI subprimer systems (found here), or a similar set of resources, to identify important/at risk data. Talk to the DataRescue organizers to learn more.
- Start exploring the website assigned, identifying important URLs.
- Decide whether the data on a page or website subsection can be automatically captured by the Internet Archive webcrawler.
- The best source of information about the seeding and sorting process is represented at https://envirodatagov.org/, see:
- Understanding What the Internet Archive Webcrawler Does
- Seeding the Internet Archive’s Webcrawler)
- URLs judged to be possibly crawlable are "nominated" (equivalently, "seeded") to the End-Of-Term project (EOT), using the EDGI Nomination Chrome extension or bookmarklet.
Wherever possible, add in the Agency Office Code. Talk to the DataRescue organizers to learn more.
- If URL is judged not crawlable, add it to the "Uncrawlable" spreadsheet through the Chrome Extension.
- In the spreadsheet is automatically associated with a universal unique identifyer (UUID) that was generated in advance.
- You can check whether the page or some files are rendered using the Internet Archive's Wayback Machine Chrome Extension
- This sorting is only provisional: when in doubt seeders nominate the URL and mark it as possibly not crawlable.