operational character rendition (OCR)
live (WIP) https://oversightmachin.es/ocr
A collection of public domain documents published by the Senate Select Committee on Intelligence (SSCI) is computationally "read" and compared to two authoritative lexicons: a standard spelling dictionary and a list of flagged words for social media used by the Department of Homeland Security. This comparison uses the Levenshtein Distance, a metric that measures 'distance' between two strings by quantifying how many changes (addition, subtraction, or changing of a letter) are needed to reach one from the other. If no exact match is found, all proximate candidates are displayed. For instance, the distance from "cat" to "bat" is Levenshtein: 1. From "intelligence" to "negligence", Levenshtein: 3.
These are poor images, unsearchable low quality scans released with no text layer, despite being composed on word processors. These undocuments range from questionnaires by nominees for positions under the Director of National Intelligence to briefings about interrogation techniques and extraordinary rendition.
The SSCI is dedicated to overseeing activities of sixteen agencies and bureaus that compose the US Intelligence Community. Its scope includes the surveillance of US and foreign citizens by the NSA and acts of torture and extraordinary rendition performed by the CIA, released in a report by the Committee in December 2014.
OCR's rendition (an exaggerated performance) of reading happens in the viewer's browser, and reports its findings back to a server. Apart from the images and metadata produced by this process, no user data or logs are kept.
This work is part of a series of browser-based "oversight machines" I've been developing that considers oversight (which denotes both supervision and a failure to notice) in a world in which media is increasingly made for and by machines. These unreliable interrogators operate not only on media, but also upon the algorithms and apparatuses of sense-making involved in making this work public.
Selected Materials:
- SSCI Documents (PDF) obtained from http://www.intelligence.senate.gov
- aspell dictonary: http://wordlist.aspell.net/dicts/
- DHS flagged word list: https://gist.github.com/jm3/2815378)
- OCRAD.js: https://github.com/antimatter15/ocrad.js