Pack records and aggregates your computer use — screenshots plus input events (click, keypress, scroll, cursor move). It groups activity into event bursts and uses a VLM pipeline to generate human-readable captions describing what happened.
- Python 3.11+ (3.12.7 recommended)
ffmpeg(for video generation)uv
git clone https://github.com/GeneralUserModels/pack.git # Clone repo
cd pack
cp .env.example .env # Optionally add your Gemini API key hereTwo main entry points:
- Record a session
uv run -m record --monitor # start and monitor recording. Use --accessibility to capture accessibility info.CTRL+C # stop recording - Label a session
uv run -m label \ --sessions-root logs/ `# label all sessions in logs dir` \ --skip-existing `# skip sessions already processed` \ --client gemini `# or vllm, bigquery` \ --model gemini-2.5-pro \ --annotate `# visualize mouse movement and click positions` \ --visualize `# final video creation`
What it does: Records screen activity and user input events into a session folder.
| Flag | Type | Default | Description |
|---|---|---|---|
-f, --fps |
int | 30 |
Frames per second to capture |
-s, --buffer-seconds |
int | 12 |
Seconds to keep in buffer |
-b, --buffer-all-images |
flag | off | Save all buffer images to disk |
-m, --monitor |
flag | off | Enable real-time monitoring of the last session |
-r, --max-res <width> <height> |
int int | none | Maximal resolution for screenshots |
-p, --precision |
accurate / rough |
accurate |
Precision level for event aggregation (presets) |
-c, --compression-quality |
int | 70 | JPEG compression quality |
logs/session_name
├── aggregations.jsonl
├── events.jsonl
├── screenshots
│ └── 1760971355.978042_reason_key_start.jpg
├── screenshots.jsonl
└── summary.pngWhat it does: Loads recorded sessions or raw screenshots, chunks and formats them, runs VLM labeling, and optionally renders annotated videos.
--session <PATH>— single session folder--sessions-root <PATH>— process all sessions under this root
| Flag | Type | Default | Description |
|---|---|---|---|
--chunk-duration |
int | 60 |
Chunk duration in seconds |
--fps |
int | 1 |
Frame sampling rate |
--skip-existing |
flag | off | Skip already processed sessions |
| Flag | Description |
|---|---|
--screenshots-only |
Process raw screenshots without aggregations or input event annotations |
--video-extensions .mp4 .avi ... |
Recognized video extensions |
--prompt-file |
Custom prompt file (defaults to prompts/screenshots_only.txt in screenshots-only mode or prompts/default.txt) |
Note
Screenshots-only mode requires your session folder to contain a screenshots/ subdirectory with image files (.jpg, .jpeg, or .png).
| Flag | Description |
|---|---|
--annotate |
Overlay cursor and click markers (only for standard processing) |
--visualize |
Create annotated video visualizations |
| Flag | Default |
|---|---|
--client gemini / vllm / bigquery |
gemini |
--model |
auto-selects: gemini-2.5-flash, Qwen/Qwen3-VL-8B-Thinking-FP8, or gemini-2.0-flash-exp |
--num-workers |
4 |
| Flag | Description |
|---|---|
--vllm-url |
vLLM server URL (e.g., http://localhost:8000) |
| Flag | Description |
|---|---|
--bq-bucket-name |
GCS bucket name for uploading videos |
--bq-gcs-prefix |
Prefix/folder path in GCS bucket (default: video_chunks) |
--bq-object-table-location |
Object table location (default: us) |
Note: For BigQuery,
--modelshould be the full model reference (e.g.,dataset.modelorproject.dataset.model)
logs/session_name
├── aggregations
│ └── 000.json # chunked aggregations used for LLM prompting
├── annotated.mp4 # final video showing captions and input events
├── captions
│ └── 000.json # generated captions of chunk
├── captions.jsonl # summarized generated captions
├── chunks
│ ├── 000.mp4 # video chunk used for LLM prompting
│ ├── master.mp4
│ └── prompt_000.txt # prompt used for LLM prompting
└── data.jsonl # final data containing raw input events and LLM generated captionsGemini + annotated video:
uv run -m label \
--session logs/session_xyz \
--client gemini \
--model gemini-2.5-flash \
--annotate \
--visualizeProcess all sessions in a folder:
uv run -m label \
--sessions-root logs/ \
--client gemini \
--annotateScreenshots-only labeling (with vLLM):
uv run -m label \
--session path_with_screenshots \
--screenshots-only \
--client vllm \
--model Qwen/Qwen3-VL-8B-Thinking-FP8 \
--vllm-url http://localhost:8000/v1Note
For deploying a vllm server, see the vLLM documentation. E.g. you can run:
vllm serve Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 --host 127.0.0.1 --port 8000 --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --guided-decoding-backend outlines --enable-expert-parallel --enforce-eager
vllm serve Qwen/Qwen3-VL-8B-Thinking-FP8 --host 127.0.0.1 --port 8000 --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --guided-decoding-backend outlinesBigQuery batch processing:
uv run -m label \
--sessions-root logs/ \
--client bigquery \
--model my_dataset.gemini_flash_remote \
--bq-bucket-name my-bucket \
--bq-gcs-prefix my_folder \
--bq-object-table-location us.my-connection \
--annotate \
--visualizeNote
For using BigQuery ML with Gemini models:
- Create a remote model in BigQuery that connects to Vertex AI:
CREATE OR REPLACE MODEL `my-project.my_dataset.gemini_flash_remote`
REMOTE WITH CONNECTION `my-project.us.my-connection`
OPTIONS (endpoint = 'gemini-2.0-flash-exp');- Set up a Cloud Storage bucket and ensure your service account has permissions to:
- Write to the GCS bucket
- Execute BigQuery ML queries
- Access the remote model connection
- The
--modelparameter should be the full BigQuery model reference:- 2-part format:
my_dataset.gemini_flash_remote(uses default project) - 3-part format:
my-project.my_dataset.gemini_flash_remote(explicit project)
- 2-part format:
The record module captures screenshots and user input events (mouse_move, mouse_scroll, mouse_up, mouse_down, key_press, key_release) and organizes them into per-category buffers and a global chronological buffer.
Key behavior:
- Screenshots are captured at
1 / fpsseconds (defaultfps=30→ ~0.0625 s between frames). The recorder maintains a cyclic buffer retaining the lastbuffer_seconds(default 12s → 360 frames at 320 fps). The--buffer-all-images(-b) flag exports all buffer frames to disk (off by default). - Input events are categorized into click, move, scroll, and key buffers while also appended to a shared chronological buffer (used later for alignment).
Burst detection and aggregation:
- New events are appended to the category buffer and the shared buffer.
- If a new event occurs within
gap_thresholdseconds of the previous event in that category, it is considered part of the current burst. If the category buffer reachestotal_thresholdevents, it is split: the first half is sent to aggregation and the remainder stays buffered. Otherwise the event is appended. - If the new event is outside
gap_threshold, the existing burst is closed (aggregated) and the new event starts a fresh burst. - A background worker runs every second to close bursts whose most recent event is older than
gap_threshold, ensuring no events are lost when screenshots roll out of the cyclic buffer. - If the monitor of the cursor position changed compared to the last event, a new burst is started automatically.
Aggregation flow (high level):
- Aggregation requests are queued for the screenshots immediately before and after a burst (the recorder picks screenshots at ±75 ms around the burst edges).
- A worker ensures no intervening requests could alter the burst end time (bounded by
total_threshold); when safe, the following request’s start time can be used to set the current burst end. - All events between burst start and end are pulled from the shared buffer and saved alongside the before/after screenshots into
aggregations.jsonl. All disk writes are performed asynchronously so the recorder loop stays responsive.
The label module:
- Loads sessions or raw screenshots, chunks them and their logs, and prepares inputs for the VLM.
- Uses prompts (in
label/prompts) to instruct the VLM to generate captions that describe the user's actions and context. - Produces
captions.jsonlanddata.jsonl(captions aligned to screenshots and events). - Optionally renders an annotated video (
annotated.mp4) showing captions and event visualizations overlayed on frames.
The label step performs a second layer of aggregation: it uses the bursts detected at recording time and further refines and annotates them with VLM outputs to create final human-readable summaries.