Python library and supporting utilities to support text analysis, reporting, and redaction workflows.
This repository includes interactive Jupyter notebooks and sample data to demonstrate how blacktape can be implemented in various workflows. The easiest way to try out these notebooks is by clicking on the MyBinder button above, launching a remotely hosted JupyterLab session in your browser.
Several of these notebooks are described below:
end-to-end.ipynb: Reads in a sample English language file, chunks it into 10,000 character blocks, sets up a SQLite3 database for output. Sets up a pipelined workflow in which each chunk is analyzed to identify named entities and a selection of regular expressions. Includes a selection of database queries and produces a sample redacted output file in which redaction targets are overwritten with a fixed-length block sequence.
matching.ipynb: Simple demonstration of NER and pattern matching for a given input.
pii_patterns.ipynb: Matching for common PII patterns against synthesized target data in a smple file.
chunking_pipeline.ipynb: Large file chunking and job submission within a processing pipeline, illustrated by matching entities identified as PERSON or ORGANIZATION.
Additional notebooks are provided that demonstrate subsets of the functionality illustrated in the above examples.
Logos, documentation, and other non-software products of the CARASCAP team are distributed under the terms of Creative Commons 4.0 Attribution. Software items in CARASCAP repositories are distributed under the terms of the MIT License. See the LICENSE file for additional details.
© 2022, The University of North Carolina at Chapel Hill.
Developed by the CARASCAP team at the University of North Carolina at Chapel Hill.