A desktop application for generating and managing captions for images and videos using AI vision models. This is a Rust rewrite of spacecat sage, built with Tauri, React, and TypeScript.
- AI-Powered Captions: Generate descriptive captions for images and videos using OpenAI compatible vision models
- Batch Processing: Select multiple files to generate captions in bulk
- File Management: Organize and manage your media files with an intuitive interface
- Caption Editing: Manually edit AI-generated captions with a built-in editor
- Image Cropping: Crop images directly within the application
- Video Processing:
- Trim videos to specific time ranges
- Crop videos to specific dimensions
- Extract frames from videos for captioning
- Project Management: Create, save, and manage multiple captioning projects
- Export Options: Export your captioned media as a directory or ZIP file
This is a simplified implementation compared to spacecat sage:
- Direct filesystem operations instead of SQLite database (pros: simpler, cons: potentially slower for large projects)
- Simplified API configuration focused primarily on OpenAI's API (the original prioritized JoyCaption via vLLM and other OpenAI-compatible APIs with built-in prompts)
- More lightweight overall with a smaller codebase
However, using the older gpt-4o-2024-05-13 provides good enough support for human content, and you can still use JoyCaption if you want. The API implementation is the same interface (OpenAI-compatible), just requiring your own base URL when using your own models.
- FFmpeg - Required for video processing functionality
While the application is built with Tauri and should work cross-platform, official builds are only signed for:
- macOS (Intel and Apple Silicon)
For other platforms, you can run the development server instead of building the application.
- Download the latest release for macOS from the Releases page
- Install the application using the provided installer or by extracting the archive
-
Clone the repository
git clone https://github.com/markuryy/spacecat-caption.git -
Install dependencies
curl -fsSL https://bun.sh/install | bash # bun curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # rust
cd spacecat-caption bun install cd src-tauri cargo fetch -
Run the development server
cd .. bun run tauri dev
- Open the application and click the gear icon to open Settings
- Enter your OpenAI API key and configure other settings
- Click the "Select Folder" button to choose a directory containing images and videos
- The application will create a working copy of your files for safe editing
- Select one or more files in the sidebar
- Click "Generate Captions" to process all selected files
- For individual files, select the file and use Shift+G or click the "Generate Caption" button
For videos, only the first or (rough) current frame is sent when captioning via API. Be sure to adjust your prompts accordingly, as the LLM can typically infer what occurs in a video from the first frame.
- Select an image or video in the sidebar
- Use the crop or trim tools in the editor panel
- Save your changes to update the file
- Captions are automatically saved when modified
- Use the export button to save your project as a directory or ZIP file
- API URL: The URL for the OpenAI API endpoint (default: https://api.openai.com/v1/chat/completions)
- API Key: Your OpenAI API key
- Model: The AI model to use (recommended: gpt-4o-2024-05-13 or gpt-4o)
- Image Detail Level:
- Auto: Let the model decide based on image size
- Low: Uses a 512px x 512px version (85 tokens)
- High: First uses low-res, then creates detailed crops (255 tokens)
- Caption Prompt: The prompt text to use when generating captions
- Shift + ←: Navigate to previous image/video
- Shift + →: Navigate to next image/video
- Shift + G: Generate caption for current image/video
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.
- Built with Tauri, React, and TypeScript
- Uses OpenAI's API for AI-powered captioning
- FFmpeg for video processing functionality


