Skip to content

Latest commit

 

History

History
178 lines (143 loc) · 12.3 KB

Developer_Manual.md

File metadata and controls

178 lines (143 loc) · 12.3 KB

PyGAAP is the Python port of JGAAP, Java Graphical Authorship Attribution Program by Patrick Juola et al.
See https://evllabs.github.io/JGAAP/

Updated: 2022.07.14

PyGAAP Developer Manual

Table of contents

  1. Differences from JGAAP
  2. Widget structures
    1. Outline of tkinter widgets
    2. Debug modes
  3. Adding a new module
    1. Classs variables
    2. Class initialization
    3. Class functions
    4. Reload modules while PyGAAP is running

Differences from JGAAP

Module parameters

  1. Unlike JGAAP, PyGAAP does not (currently) use dedicated classes for module parameters. See Class variables.

Widget structures

Outline of tkinter widgets

name                                         description (Tkinter module used)

topwindow                                             name of main window (Tk)
├── menubar                                           name of top menu bar (Menu)
├── workspace                                         main Frame under topwindow that contains notebook Tabs (Frame)
│   ├── tabs                                          This sets up the tabs (Notebook)
│   │   ├── Tab_Documents                             Holds widgets in Documents tab (Frame)
│   │   │   ├── Tab_Documents_UnknownAuthors_Frame   Contains Listbox for unknown authors (Frame)
│   │   │   ├── Tab_Documents_doc_buttons            Buttons for unknown authors (Frame)
│   │   │   ├── Tab_Documents_KnownAuthors_Frame     Contains Listbox for unknown authors (Frame)
│   │   │   ├── Tab_Documents_knownauth_buttons      Buttons for known authors (Frame)
│   │   |
│   │   │      The 4 tabs below are generated by create_feature_tab(). The widgets in these tabs are saved in "objects".
│   │   ├── Tab_Canonicizers           Holds widgets in Canonicizers tab (Frame)           widgets in generated_widgets['Canonicizers']
│   │   ├── Tab_EventDrivers           Holds widgets in Event Drivers tab (Frame)          widgets in generated_widgets['EventDrivers']
│   │   ├── Tab_EventCulling           Same Setup as Event Drivers Tab                     widgets in generated_widgets['EventCulling']
│   │   ├── Tab_AnalysisMethods        Holds widgets in Analysis Methods tab (Frame)       widgets in generated_widgets['AnalysisMethods']
│   │   |
│   │   │      within the dictionary entry listed above as % the structure is as follows:
│   │   ├── %
│   │   │   ├── %["top_frame"]
│   │   │   │   ├── %["available_frame"]                Contains the listboxes where available features are displayed.
|   |   |   |   |   └── %["available_listboxes]       = [
|   |   |   |   |                                           [Frame, Label, Listbox, Scrollbar],
|   |   |   |   |                                           [Frame, Label, Listbox, Scrollbar] # for analysis methods (two listboxes to choose from)
|   |   |   |   |                                       ]
│   │   │   │   ├── %["buttons_frame"]                  Contains the add/remove/clear buttons.
│   │   │   │   ├── %["selected_frame"]                 Contains the listbox where selected features are displayed.
|   |   |   |   |   └── %["selected_listboxes]       = [Frame, Label, Lixtbox/Treeview, Scrollbar]
│   │   │   │   └── %["parameters_frame"]               Contains the frame where parameters of a feature are displayed.
│   │   │   |
│   │   │   ├── ["description_frame"]                Contains the text box where the description of a feature is displayed.
│   │   |
│   │   ├── Tab_ReviewProcess                      Holds widgets in Review & Process tab (Frame)
│   │   │   ├── Tab_ReviewProcess_Canonicizers       Contains corresponding listbox
│   │   │   ├── Tab_ReviewProcess_EventDrivers
│   │   │   ├── Tab_ReviewProcess_EventCulling
│   │   │   ├── Tab_ReviewProcess_AnalysisMethods
│   │   |
├── bottomframe                                Hold buttons at the bottom: Notes, Next, and Finish.
└── status_bar                                 Contains the label (text) for status.

Debug modes

In the GUI code, set GUI_debug to 3 to see function calls printed to the terminal.

Adding a new module

Naming Files

Add new modules to ./generics/modules for the API to pick up while loading. Always add a line to import the corresponding abstract type from ~/generics. Here are the imports for each module type.

from generics.Canonicizer import Canonicizer
from generics.EventDriver import EventDriver
from generics.EventCulling import EventCulling
from generics.Embedding import Embedding
from generics.AnalysisMethod import AnalysisMethod
from generics.DistanceFunction import DistanceFunction

Add package dependencies and their version numbers to ~/requirements.txt, if applicable.
As a readability consideration, it's recommended that the files in ~/generics/modules be prefixed with the following:
cc_ for canonicizers
ed_ for event drivers
ec_ for event cullers
nc_ for embedders
am_ for analysis methods
df_ for distance functions.

Class variables

Class variables are declared within the class definition, outside of the __init__ function.

User parameters

Each user parameter is a class variable exposed to the GUI. These variables must also have corresponding entries in _variable_options, and their names cannot begin with a "_". To hide a class variable from the GUI, prefix the name with a "_".

System parameters

  • _global_parameters (dictionary) API parameters to be passed to all modules, like language.

  • _variable_options (dictionary) lists the options, GUI type, and the default values of variables. The variables' names are the keys and their attributes are dicts. Each dict for a variable must have "options" for range of available choices, "type" for the GUI widget type (currently supports OptionMenu and Slider), and "default" for the default value as an index of the "options" list (for the example below, the default is 0, which picks the item with 0 index in the "options" list as the default value, i.e. the default value for the variable is 3). Optionally, add a display name if different from the variable name.
    Example: {"variable_1": {"options": range(3, 10), "type": "OptionMenu", "default": 0, "displayed_name": "The First Variable"}}
    Widget types and required keys other than options:

    • Slider: resolution.
  • _NoDistanceFunction_ (AnalysisMethod only, boolean) if an anlysis method does not allow a distance function to be set, add this and set it to True. It's False if omitted.

Class initialization

The __init__() method for module classes contains initialization for required parameters. These are handled in the abstract (base) class at the top of the generic module files (~/generics/...). Use an after_init(**options) function if there are extra steps for a module right after initialization. It takes key-word arguments passed into __init__().

Class functions

  • All modules are required to have displayName() and displayDescription().
    • displayName() -> str returns the name of the module. Note that the name of a distance function cannot be NA, which is reserved for a place-holder for analysis methods that don't use distance functions.
    • displayDescription() -> str returns a description of the module.

❗ Make sure to return and not print the names and descriptions.

Functions by types of module: these can all be overwritten, but the input/output types must match the originals.

  • Canonicizers:

    • process(document: backend.Document, pipe: Multiprocessing.pipe*=None) -> None

      • Built-in. If not overwritten, processes all documents by calling process_single in a loop
    • process_single(text: str)

      Canonicizers are expected to write str to Document.canonicized, either in process_single() or process().

  • Event drivers:

    • process(document: backend.Document, pipe: Multiprocessing.pipe=None) -> None

      • (see canonicizers above)
    • process_single(text: str)

    • setParams(list) -> None

      Event drivers are expected to call Document.setEventSet(eventSet **options) to append events, either in process_single() or process(). Overwriting events (using keyword append=False) is not recommended.

  • Event cullers:

    • process(document: backend.Document, pipe: Multiprocessing.pipe=None) -> None
      • (see canonicizers above)
    • process_single(eventSet: list)

    Event cullers are expected to call Document.setEventSet(eventSet, append=False) to overwrite document event sets in process_single() or process().

  • Embedders:

    • convert(docs: list[backend.Document]) -> np.ndarray of shape (len(docs), *, ...)

      Text embedders are expected to both write to Document.numbers for each document and return an np.ndarray (or compatible type) where the first dimension is the number of all documents, known or unknown. For example, in the roberta module, each document is embedded in a 768-long vector. If roberta receives 23 documents in total, it returns an ndarray of shape (23, 768).

  • Analysis methods:

    • train(self, train: list[backend.Document], train_data: np.ndarray=None, **options) -> None

    • analyze(self, test, test_data=None, **options) -> dict

    • setDistanceFunction() (optional)

      Analysis methods are expected to return a list of dicts whose keys are authors and values are scores for each unknown category where a lower score is higher ranked.

Canonicizers, Event drivers, and event cullers use multi-processing in process() if not over-written, with each process processing one file. Embedders and Analysis methods don't use multi-processing by default because their processing is commonly vectorized.
To disable the built-in multi-processing, set _default_multiprocessing to False.

* pipe is an end of a multiprocessing Pipe to send str/int/float updates to the GUI while the module is running. The pipe connects between the experiment runner and the GUI. If a module takes a long time to run, it's recommended that the author use Pipe.send() to regularly send updates to the GUI to be shown to the user so the app doesn't appear frozen.
How to send updates:

  • send str types to change displayed text. e.g
    if pipe is not None: pipe.send("tokenizing...")
  • send float or int to change the progress bar, where 0 is empty and 100 is full. e.g.
    if pipe is not None: pipe.send(doc_index*100/n_docs).

Reload modules while PyGAAP is running

To reload all modules while PyGAAP is running, go to the top menu bar: "Developer" $\rightarrow$ "Reload all modules".
There will be a confirmation in the status bar on success or an error message window on failure.

❗ Reloading will remove all selected modules. It does not remove documents.
❗ This does not reload libraries that the modules may import, e.g. SpaCy.