Models
Every AI model in OpenCut AI is open source and runs locally on your machine. Cloud APIs available for Indian and multilingual content.
Lite
4-8 GBLlama 3.2 1B / Kimi K2 Q3 + Whisper Base + noisereduce
Runs on any machine. Basic AI commands and transcription. Kimi K2 Q3 brings strong reasoning even in the Lite tier.
Standard
8-16 GBKimi K2 Q4 / Llama 3.2 3B / Mistral 7B + Whisper Small + XTTS v2 + rembg
Recommended for most users. Kimi K2 Q4 delivers frontier reasoning with TurboQuant compression — the new sweet spot for video scripts.
Pro
16-32+ GBKimi K2 Q5 / Llama 3.1 8B + Whisper Medium + SDXL + XTTS v2 + full stack
Best quality across the board. Kimi K2 Q5 near-lossless at 5-bit with TurboQuant. GPU recommended for image generation and fast Kimi inference.
TurboQuant Memory Compression
All tiers benefit from TurboQuant KV cache compression. Full AI stack memory dropped from 35 GB to 15 GB. One-click setup in Settings auto-detects your hardware and recommends the best configuration.
4-bit
0.9986 cosine sim
Near-lossless
3-bit
0.9953 cosine sim
5x compression
2-bit
0.9874 cosine sim
7.3x compression
All models
Kimi K2
by MoonshotAI
MoonshotAI's flagship open-source model — 1T total / 32B active MoE architecture. Frontier-class reasoning for complex video scripts, agentic editing commands, and long-context analysis. Runs via Ollama (Q3/Q4/Q5 GGUF) or TurboQuant (HuggingFace, 4-bit NF4). Context: 128K tokens.
Kimi VL A3B
by MoonshotAI
Kimi Vision-Language model with 3B active parameters. Understands images and text for scene analysis, video frame description, and multimodal editing commands. Two variants: Instruct (general) and Thinking (chain-of-thought reasoning). Runs via TurboQuant with 4-bit NF4 quantization.
Whisper
by OpenAI
Automatic speech recognition with word-level timestamps. Powers transcription, subtitles, and text-based editing. Available from tiny (39M) to large-v3 (1.5B). TurboQuant compression lets Whisper Medium fit where only Base could before.
Llama 3.2 / 3.1
by Meta
Powers AI commands, script writing, chapter analysis, clip finding, smart suggestions, and prompt enhancement. Runs via Ollama. Available at 1B (Lite), 3B (Standard), and 8B (Pro) with Q3/Q4/Q5 quantization.
Mistral 7B
by Mistral AI
High quality 7B instruction-tuned model. Available at Q4 (Standard tier) and Q5 (Pro tier). Strong at structured JSON generation for editor commands and content analysis.
Qwen2.5
by Alibaba
TurboQuant-validated model family from 0.5B to 14B. The 3B Instruct variant is benchmarked with full compression data (0.9986 cosine similarity at 4-bit). Also includes Coder variants for technical content.
Stable Diffusion XL
by Stability AI
Generate images from text prompts for video overlays, thumbnails, and B-roll. Supports SDXL Turbo (4-step fast generation) and full SDXL (20-step quality).
FLUX.1
by Black Forest Labs
Next-generation text-to-image with superior prompt adherence and image quality. Alternative to SDXL for higher quality results.
XTTS v2
by Coqui
Local voice synthesis with voice cloning from a 6-second sample. Supports 17 languages. TurboQuant at 3-bit shrinks it from 1.8 GB to 0.6 GB while preserving voice quality.
Sarvam AI (Saaras / Bulbul)
by Sarvam AI
Purpose-built for 22 Indian regional languages. Saaras v3 for transcription, Bulbul v3 for text-to-speech with 37+ natural voices, plus translation and transliteration. Hindi, Tamil, Telugu, Kannada, Bengali, Malayalam, and more.
Smallest AI (Waves)
by Smallest AI
Ultra-fast cloud TTS and STT. Lightning TTS with 80+ voices across 15 languages at ~100ms latency. Pulse STT covers 39 languages with speaker diarization and emotion detection.
Pyannote
by Herve Bredin
Neural speaker diarization that identifies who is talking when. Powers multi-speaker detection and automatic speaker-boundary cuts. Falls back to FFmpeg silence-based detection when unavailable.
MediaPipe Face Detection
by Google
Real-time face detection for auto-reframe. Tracks faces across video frames to generate 9:16 crops for TikTok, Reels, and Shorts.
SpeechBrain
by SpeechBrain
Detects emotional peaks in audio for finding the most impactful moments. Used for clip scoring and highlight detection. Falls back to FFmpeg energy-based analysis locally.
TurboQuant
by OpenCut AI
KV cache compression using Google Research's PolarQuant + QJL. Achieves 6x memory reduction at 3-bit with 0.9953 cosine similarity. Makes 7B models run on 8 GB RAM. One-click setup in Settings.
U-Net (rembg)
by danielgatis
Remove backgrounds from images to create transparent overlays. Useful for speaker cutouts, product shots, and compositing.
noisereduce
by Tim Sainburg
Spectral gating noise reduction for cleaning up audio recordings. Removes background noise, hiss, and hum with adjustable strength.