PyGAAP is the Python port of JGAAP, Java Graphical Authorship Attribution Program by Patrick Juola et al.
See https://evllabs.github.io/JGAAP/
Updated: 2022.07.14
- Unlike JGAAP, PyGAAP does not (currently) use dedicated classes for module parameters. See Class variables.
name description (Tkinter module used)
topwindow name of main window (Tk)
├── menubar name of top menu bar (Menu)
├── workspace main Frame under topwindow that contains notebook Tabs (Frame)
│ ├── tabs This sets up the tabs (Notebook)
│ │ ├── Tab_Documents Holds widgets in Documents tab (Frame)
│ │ │ ├── Tab_Documents_UnknownAuthors_Frame Contains Listbox for unknown authors (Frame)
│ │ │ ├── Tab_Documents_doc_buttons Buttons for unknown authors (Frame)
│ │ │ ├── Tab_Documents_KnownAuthors_Frame Contains Listbox for unknown authors (Frame)
│ │ │ ├── Tab_Documents_knownauth_buttons Buttons for known authors (Frame)
│ │ |
│ │ │ The 4 tabs below are generated by create_feature_tab(). The widgets in these tabs are saved in "objects".
│ │ ├── Tab_Canonicizers Holds widgets in Canonicizers tab (Frame) widgets in generated_widgets['Canonicizers']
│ │ ├── Tab_EventDrivers Holds widgets in Event Drivers tab (Frame) widgets in generated_widgets['EventDrivers']
│ │ ├── Tab_EventCulling Same Setup as Event Drivers Tab widgets in generated_widgets['EventCulling']
│ │ ├── Tab_AnalysisMethods Holds widgets in Analysis Methods tab (Frame) widgets in generated_widgets['AnalysisMethods']
│ │ |
│ │ │ within the dictionary entry listed above as % the structure is as follows:
│ │ ├── %
│ │ │ ├── %["top_frame"]
│ │ │ │ ├── %["available_frame"] Contains the listboxes where available features are displayed.
| | | | | └── %["available_listboxes] = [
| | | | | [Frame, Label, Listbox, Scrollbar],
| | | | | [Frame, Label, Listbox, Scrollbar] # for analysis methods (two listboxes to choose from)
| | | | | ]
│ │ │ │ ├── %["buttons_frame"] Contains the add/remove/clear buttons.
│ │ │ │ ├── %["selected_frame"] Contains the listbox where selected features are displayed.
| | | | | └── %["selected_listboxes] = [Frame, Label, Lixtbox/Treeview, Scrollbar]
│ │ │ │ └── %["parameters_frame"] Contains the frame where parameters of a feature are displayed.
│ │ │ |
│ │ │ ├── ["description_frame"] Contains the text box where the description of a feature is displayed.
│ │ |
│ │ ├── Tab_ReviewProcess Holds widgets in Review & Process tab (Frame)
│ │ │ ├── Tab_ReviewProcess_Canonicizers Contains corresponding listbox
│ │ │ ├── Tab_ReviewProcess_EventDrivers
│ │ │ ├── Tab_ReviewProcess_EventCulling
│ │ │ ├── Tab_ReviewProcess_AnalysisMethods
│ │ |
├── bottomframe Hold buttons at the bottom: Notes, Next, and Finish.
└── status_bar Contains the label (text) for status.
In the GUI code, set GUI_debug
to 3
to see function calls printed to the terminal.
Add new modules to ./generics/modules
for the API to pick up while loading. Always add a line to import the corresponding abstract type from ~/generics
. Here are the imports for each module type.
from generics.Canonicizer import Canonicizer
from generics.EventDriver import EventDriver
from generics.EventCulling import EventCulling
from generics.Embedding import Embedding
from generics.AnalysisMethod import AnalysisMethod
from generics.DistanceFunction import DistanceFunction
Add package dependencies and their version numbers to ~/requirements.txt
, if applicable.
As a readability consideration, it's recommended that the files in ~/generics/modules
be prefixed with the following:
cc_
for canonicizers
ed_
for event drivers
ec_
for event cullers
nc_
for embedders
am_
for analysis methods
df_
for distance functions.
Class variables are declared within the class definition, outside of the __init__
function.
Each user parameter is a class variable exposed to the GUI. These variables must also have corresponding entries in _variable_options
, and their names cannot begin with a "_
".
To hide a class variable from the GUI, prefix the name with a "_
".
-
_global_parameters
(dictionary) API parameters to be passed to all modules, likelanguage
. -
_variable_options
(dictionary) lists the options, GUI type, and the default values of variables. The variables' names are the keys and their attributes are dicts. Each dict for a variable must have"options"
for range of available choices,"type"
for the GUI widget type (currently supportsOptionMenu
andSlider
), and"default"
for the default value as an index of the"options"
list (for the example below, the default is0
, which picks the item with0
index in the"options"
list as the default value, i.e. the default value for the variable is3
). Optionally, add a display name if different from the variable name.
Example:{"variable_1": {"options": range(3, 10), "type": "OptionMenu", "default": 0, "displayed_name": "The First Variable"}}
Widget types and required keys other thanoptions
:Slider
:resolution
.
-
_NoDistanceFunction_
(AnalysisMethod
only, boolean) if an anlysis method does not allow a distance function to be set, add this and set it toTrue
. It'sFalse
if omitted.
The __init__()
method for module classes contains initialization for required parameters. These are handled in the abstract (base) class at the top of the generic module files (~/generics/...
). Use an after_init(**options)
function if there are extra steps for a module right after initialization. It takes key-word arguments passed into __init__()
.
- All modules are required to have
displayName()
anddisplayDescription()
.displayName() -> str
returns the name of the module. Note that the name of a distance function cannot beNA
, which is reserved for a place-holder for analysis methods that don't use distance functions.displayDescription() -> str
returns a description of the module.
❗ Make sure to return and not print the names and descriptions.
Functions by types of module: these can all be overwritten, but the input/output types must match the originals.
-
Canonicizers:
-
process(document: backend.Document, pipe: Multiprocessing.pipe
*=None) -> None
- Built-in. If not overwritten, processes all documents by calling
process_single
in a loop
- Built-in. If not overwritten, processes all documents by calling
-
process_single(text: str)
Canonicizers are expected to write
str
toDocument.canonicized
, either inprocess_single()
orprocess()
.
-
-
Event drivers:
-
process(document: backend.Document, pipe: Multiprocessing.pipe=None) -> None
- (see canonicizers above)
-
process_single(text: str)
-
setParams(list) -> None
Event drivers are expected to call
Document.setEventSet(eventSet **options)
to append events, either inprocess_single()
orprocess()
. Overwriting events (using keywordappend=False
) is not recommended.
-
-
Event cullers:
process(document: backend.Document, pipe: Multiprocessing.pipe=None) -> None
- (see canonicizers above)
process_single(eventSet: list)
Event cullers are expected to call
Document.setEventSet(eventSet, append=False)
to overwrite document event sets inprocess_single()
orprocess()
. -
Embedders:
-
convert(docs: list[backend.Document]) -> np.ndarray of shape (len(docs), *, ...)
Text embedders are expected to both write to
Document.numbers
for each document and return annp.ndarray
(or compatible type) where the first dimension is the number of all documents, known or unknown. For example, in theroberta
module, each document is embedded in a 768-long vector. Ifroberta
receives 23 documents in total, it returns anndarray
ofshape (23, 768)
.
-
-
Analysis methods:
-
train(self, train: list[backend.Document], train_data: np.ndarray=None, **options) -> None
-
analyze(self, test, test_data=None, **options) -> dict
-
setDistanceFunction()
(optional)Analysis methods are expected to return a list of dicts whose keys are authors and values are scores for each unknown category where a lower score is higher ranked.
-
Canonicizers, Event drivers, and event cullers use multi-processing in process()
if not over-written, with each process processing one file.
Embedders and Analysis methods don't use multi-processing by default because their processing is commonly vectorized.
To disable the built-in multi-processing, set _default_multiprocessing
to False
.
* pipe
is an end of a multiprocessing Pipe to send str
/int
/float
updates to the GUI while the module is running. The pipe connects between the experiment runner and the GUI. If a module takes a long time to run, it's recommended that the author use Pipe.send()
to regularly send updates to the GUI to be shown to the user so the app doesn't appear frozen.
How to send updates:
- send
str
types to change displayed text. e.gif pipe is not None: pipe.send("tokenizing...")
- send
float
orint
to change the progress bar, where0
is empty and100
is full. e.g.if pipe is not None: pipe.send(doc_index*100/n_docs)
.
To reload all modules while PyGAAP is running, go to the top menu bar: "Developer"
There will be a confirmation in the status bar on success or an error message window on failure.
❗ Reloading will remove all selected modules. It does not remove documents.
❗ This does not reload libraries that the modules may import, e.g. SpaCy.