Pack

Pack records and aggregates your computer use — screenshots plus input events (click, keypress, scroll, cursor move). It groups activity into event bursts and uses a VLM pipeline to generate human-readable captions describing what happened.

Quickstart

Requirements

Python 3.11+ (3.12.7 recommended)
ffmpeg (for video generation)
uv

Install

git clone https://github.com/GeneralUserModels/pack.git  # Clone repo
cd pack
cp .env.example .env  #  Optionally add your Gemini API key here

Usage

Two main entry points:

Record a session

uv run -m record --monitor  # start and monitor recording. Use --accessibility to capture accessibility info.

CTRL+C  # stop recording

Label a session

uv run -m label \
  --sessions-root logs/ `# label all sessions in logs dir` \
  --skip-existing `# skip sessions already processed` \
  --client gemini `# or vllm, bigquery` \
  --model gemini-2.5-pro \
  --annotate `# visualize mouse movement and click positions` \
  --visualize `# final video creation`

Detailed Commands

`uv run -m record` — Record a session

What it does: Records screen activity and user input events into a session folder.

Flags

Flag	Type	Default	Description
`-f, --fps`	int	`30`	Frames per second to capture
`-s, --buffer-seconds`	int	`12`	Seconds to keep in buffer
`-b, --buffer-all-images`	flag	off	Save all buffer images to disk
`-m, --monitor`	flag	off	Enable real-time monitoring of the last session
`-r, --max-res <width> <height>`	int int	none	Maximal resolution for screenshots
`-p, --precision`	`accurate` / `rough`	`accurate`	Precision level for event aggregation (presets)
`-c, --compression-quality`	int	70	JPEG compression quality

Output

logs/session_name
├── aggregations.jsonl
├── events.jsonl
├── screenshots
│   └── 1760971355.978042_reason_key_start.jpg
├── screenshots.jsonl
└── summary.png

`uv run -m label` — Label a session

What it does: Loads recorded sessions or raw screenshots, chunks and formats them, runs VLM labeling, and optionally renders annotated videos.

Session selection (required)

--session <PATH> — single session folder
--sessions-root <PATH> — process all sessions under this root

Processing options

Flag	Type	Default	Description
`--chunk-duration`	int	`60`	Chunk duration in seconds
`--fps`	int	`1`	Frame sampling rate
`--skip-existing`	flag	off	Skip already processed sessions

Screenshots-only mode

Flag	Description
`--screenshots-only`	Process raw screenshots without aggregations or input event annotations
`--video-extensions .mp4 .avi ...`	Recognized video extensions
`--prompt-file`	Custom prompt file (defaults to `prompts/screenshots_only.txt` in screenshots-only mode or `prompts/default.txt`)

Note

Screenshots-only mode requires your session folder to contain a screenshots/ subdirectory with image files (.jpg, .jpeg, or .png).

Visualization & annotations

Flag	Description
`--annotate`	Overlay cursor and click markers (only for standard processing)
`--visualize`	Create annotated video visualizations

VLM client

Flag	Default
`--client gemini / vllm / bigquery`	`gemini`
`--model`	auto-selects: `gemini-2.5-flash`, `Qwen/Qwen3-VL-8B-Thinking-FP8`, or `gemini-2.0-flash-exp`
`--num-workers`	`4`

vLLM-specific options

Flag	Description
`--vllm-url`	vLLM server URL (e.g., http://localhost:8000)

BigQuery-specific options

Flag	Description
`--bq-bucket-name`	GCS bucket name for uploading videos
`--bq-gcs-prefix`	Prefix/folder path in GCS bucket (default: `video_chunks`)
`--bq-object-table-location`	Object table location (default: `us`)

Note: For BigQuery, --model should be the full model reference (e.g., dataset.model or project.dataset.model)

Output

logs/session_name
├── aggregations
│   └── 000.json  # chunked aggregations used for LLM prompting
├── annotated.mp4  # final video showing captions and input events
├── captions
│   └── 000.json  # generated captions of chunk
├── captions.jsonl  # summarized generated captions
├── chunks
│   ├── 000.mp4  # video chunk used for LLM prompting
│   ├── master.mp4
│   └── prompt_000.txt  # prompt used for LLM prompting
└── data.jsonl  # final data containing raw input events and LLM generated captions

Examples

Gemini + annotated video:

uv run -m label \
  --session logs/session_xyz \
  --client gemini \
  --model gemini-2.5-flash \
  --annotate \
  --visualize

Process all sessions in a folder:

uv run -m label \
  --sessions-root logs/ \
  --client gemini \
  --annotate

Screenshots-only labeling (with vLLM):

uv run -m label \
  --session path_with_screenshots \
  --screenshots-only \
  --client vllm \
  --model Qwen/Qwen3-VL-8B-Thinking-FP8 \
  --vllm-url http://localhost:8000/v1

Note

For deploying a vllm server, see the vLLM documentation. E.g. you can run:

vllm serve Qwen/Qwen3-VL-30B-A3B-Thinking-FP8 --host 127.0.0.1 --port 8000 --tensor-parallel-size 8 --gpu-memory-utilization 0.9 --guided-decoding-backend outlines --enable-expert-parallel --enforce-eager
vllm serve Qwen/Qwen3-VL-8B-Thinking-FP8 --host 127.0.0.1 --port 8000 --tensor-parallel-size 4 --gpu-memory-utilization 0.9 --guided-decoding-backend outlines

BigQuery batch processing:

uv run -m label \
  --sessions-root logs/ \
  --client bigquery \
  --model my_dataset.gemini_flash_remote \
  --bq-bucket-name my-bucket \
  --bq-gcs-prefix my_folder \
  --bq-object-table-location us.my-connection \
  --annotate \
  --visualize

Note

For using BigQuery ML with Gemini models:

Create a remote model in BigQuery that connects to Vertex AI:

CREATE OR REPLACE MODEL `my-project.my_dataset.gemini_flash_remote`
REMOTE WITH CONNECTION `my-project.us.my-connection`
OPTIONS (endpoint = 'gemini-2.0-flash-exp');

Set up a Cloud Storage bucket and ensure your service account has permissions to:
- Write to the GCS bucket
- Execute BigQuery ML queries
- Access the remote model connection
The --model parameter should be the full BigQuery model reference:
- 2-part format: my_dataset.gemini_flash_remote (uses default project)
- 3-part format: my-project.my_dataset.gemini_flash_remote (explicit project)

Method

Record

The record module captures screenshots and user input events (mouse_move, mouse_scroll, mouse_up, mouse_down, key_press, key_release) and organizes them into per-category buffers and a global chronological buffer.

Key behavior:

Screenshots are captured at 1 / fps seconds (default fps=30 → ~0.0625 s between frames). The recorder maintains a cyclic buffer retaining the last buffer_seconds (default 12s → 360 frames at 320 fps). The --buffer-all-images (-b) flag exports all buffer frames to disk (off by default).
Input events are categorized into click, move, scroll, and key buffers while also appended to a shared chronological buffer (used later for alignment).

Burst detection and aggregation:

New events are appended to the category buffer and the shared buffer.
If a new event occurs within gap_threshold seconds of the previous event in that category, it is considered part of the current burst. If the category buffer reaches total_threshold events, it is split: the first half is sent to aggregation and the remainder stays buffered. Otherwise the event is appended.
If the new event is outside gap_threshold, the existing burst is closed (aggregated) and the new event starts a fresh burst.
A background worker runs every second to close bursts whose most recent event is older than gap_threshold, ensuring no events are lost when screenshots roll out of the cyclic buffer.
If the monitor of the cursor position changed compared to the last event, a new burst is started automatically.

Aggregation flow (high level):

Aggregation requests are queued for the screenshots immediately before and after a burst (the recorder picks screenshots at ±75 ms around the burst edges).
A worker ensures no intervening requests could alter the burst end time (bounded by total_threshold); when safe, the following request’s start time can be used to set the current burst end.
All events between burst start and end are pulled from the shared buffer and saved alongside the before/after screenshots into aggregations.jsonl. All disk writes are performed asynchronously so the recorder loop stays responsive.

Label

The label module:

Loads sessions or raw screenshots, chunks them and their logs, and prepares inputs for the VLM.
Uses prompts (in label/prompts) to instruct the VLM to generate captions that describe the user's actions and context.
Produces captions.jsonl and data.jsonl (captions aligned to screenshots and events).
Optionally renders an annotated video (annotated.mp4) showing captions and event visualizations overlayed on frames.

The label step performs a second layer of aggregation: it uses the bursts detected at recording time and further refines and annotates them with VLM outputs to create final human-readable summaries.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pack

Quickstart

Requirements

Install

Usage

Detailed Commands

`uv run -m record` — Record a session

Flags

Output

`uv run -m label` — Label a session

Session selection (required)

Processing options

Screenshots-only mode

Visualization & annotations

VLM client

vLLM-specific options

BigQuery-specific options

Output

Examples

Method

Record

Label

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

GeneralUserModels/pack

Folders and files

Latest commit

History

Repository files navigation

Pack

Quickstart

Requirements

Install

Usage

Detailed Commands

uv run -m record — Record a session

Flags

Output

uv run -m label — Label a session

Session selection (required)

Processing options

Screenshots-only mode

Visualization & annotations

VLM client

vLLM-specific options

BigQuery-specific options

Output

Examples

Method

Record

Label

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`uv run -m record` — Record a session

`uv run -m label` — Label a session

Packages