Models

Every AI model in OpenCut AI is open source and runs locally on your machine. Cloud APIs available for Indian and multilingual content.

Language Model
Speech-to-Text
Image Generation
Text-to-Speech
Indian Languages
Multilingual TTS/STT
Speaker Detection
Face Detection
Emotion Detection
Model Optimization
Background Removal
Audio Processing

Lite

4-8 GB

Llama 3.2 1B / Kimi K2 Q3 + Whisper Base + noisereduce

Runs on any machine. Basic AI commands and transcription. Kimi K2 Q3 brings strong reasoning even in the Lite tier.

Recommended

Standard

8-16 GB

Kimi K2 Q4 / Llama 3.2 3B / Mistral 7B + Whisper Small + XTTS v2 + rembg

Recommended for most users. Kimi K2 Q4 delivers frontier reasoning with TurboQuant compression — the new sweet spot for video scripts.

Pro

16-32+ GB

Kimi K2 Q5 / Llama 3.1 8B + Whisper Medium + SDXL + XTTS v2 + full stack

Best quality across the board. Kimi K2 Q5 near-lossless at 5-bit with TurboQuant. GPU recommended for image generation and fast Kimi inference.

TurboQuant Memory Compression

Built-in

All tiers benefit from TurboQuant KV cache compression. Full AI stack memory dropped from 35 GB to 15 GB. One-click setup in Settings auto-detects your hardware and recommends the best configuration.

4-bit

0.9986 cosine sim

Near-lossless

3-bit

0.9953 cosine sim

5x compression

2-bit

0.9874 cosine sim

7.3x compression

All models

Kimi K2

by MoonshotAI

Language Model

MoonshotAI's flagship open-source model — 1T total / 32B active MoE architecture. Frontier-class reasoning for complex video scripts, agentic editing commands, and long-context analysis. Runs via Ollama (Q3/Q4/Q5 GGUF) or TurboQuant (HuggingFace, 4-bit NF4). Context: 128K tokens.

commandsscriptsreasoninglong-contextturboquant
1T (32B active) MoEKimi K2 Open

Kimi VL A3B

by MoonshotAI

Language Model

Kimi Vision-Language model with 3B active parameters. Understands images and text for scene analysis, video frame description, and multimodal editing commands. Two variants: Instruct (general) and Thinking (chain-of-thought reasoning). Runs via TurboQuant with 4-bit NF4 quantization.

multimodalvisionreasoningturboquant
3B active paramsApache 2.0

Whisper

by OpenAI

Speech-to-Text

Automatic speech recognition with word-level timestamps. Powers transcription, subtitles, and text-based editing. Available from tiny (39M) to large-v3 (1.5B). TurboQuant compression lets Whisper Medium fit where only Base could before.

transcriptionsubtitlestimestamps
39M - 1.5B paramsMIT

Llama 3.2 / 3.1

by Meta

Language Model

Powers AI commands, script writing, chapter analysis, clip finding, smart suggestions, and prompt enhancement. Runs via Ollama. Available at 1B (Lite), 3B (Standard), and 8B (Pro) with Q3/Q4/Q5 quantization.

commandsscriptsanalysischapters
1B - 8B paramsLlama 3 Community

Mistral 7B

by Mistral AI

Language Model

High quality 7B instruction-tuned model. Available at Q4 (Standard tier) and Q5 (Pro tier). Strong at structured JSON generation for editor commands and content analysis.

commandsscriptsjsonanalysis
7B paramsApache 2.0

Qwen2.5

by Alibaba

Language Model

TurboQuant-validated model family from 0.5B to 14B. The 3B Instruct variant is benchmarked with full compression data (0.9986 cosine similarity at 4-bit). Also includes Coder variants for technical content.

turboquantcommandscodinganalysis
0.5B - 14B paramsApache 2.0

Stable Diffusion XL

by Stability AI

Image Generation

Generate images from text prompts for video overlays, thumbnails, and B-roll. Supports SDXL Turbo (4-step fast generation) and full SDXL (20-step quality).

imagesoverlaysthumbnails
3.5B - 6.6B paramsOpenRAIL-M

FLUX.1

by Black Forest Labs

Image Generation

Next-generation text-to-image with superior prompt adherence and image quality. Alternative to SDXL for higher quality results.

imagesoverlayshigh-quality
12B paramsApache 2.0

XTTS v2

by Coqui

Text-to-Speech

Local voice synthesis with voice cloning from a 6-second sample. Supports 17 languages. TurboQuant at 3-bit shrinks it from 1.8 GB to 0.6 GB while preserving voice quality.

voiceovervoice-cloninglocal
467M paramsMPL 2.0

Sarvam AI (Saaras / Bulbul)

by Sarvam AI

Indian Languages

Purpose-built for 22 Indian regional languages. Saaras v3 for transcription, Bulbul v3 for text-to-speech with 37+ natural voices, plus translation and transliteration. Hindi, Tamil, Telugu, Kannada, Bengali, Malayalam, and more.

indian-languagestranscriptionttstranslation
Cloud APIAPI Key Required

Smallest AI (Waves)

by Smallest AI

Multilingual TTS/STT

Ultra-fast cloud TTS and STT. Lightning TTS with 80+ voices across 15 languages at ~100ms latency. Pulse STT covers 39 languages with speaker diarization and emotion detection.

ttssttmultilingualfast
Cloud APIAPI Key Required

Pyannote

by Herve Bredin

Speaker Detection

Neural speaker diarization that identifies who is talking when. Powers multi-speaker detection and automatic speaker-boundary cuts. Falls back to FFmpeg silence-based detection when unavailable.

speakersdiarizationpodcast
~90M paramsMIT

MediaPipe Face Detection

by Google

Face Detection

Real-time face detection for auto-reframe. Tracks faces across video frames to generate 9:16 crops for TikTok, Reels, and Shorts.

face-trackingauto-reframe9:16
~1M paramsApache 2.0

SpeechBrain

by SpeechBrain

Emotion Detection

Detects emotional peaks in audio for finding the most impactful moments. Used for clip scoring and highlight detection. Falls back to FFmpeg energy-based analysis locally.

emotionshighlightsintensity
~300M paramsApache 2.0

TurboQuant

by OpenCut AI

Model Optimization

KV cache compression using Google Research's PolarQuant + QJL. Achieves 6x memory reduction at 3-bit with 0.9953 cosine similarity. Makes 7B models run on 8 GB RAM. One-click setup in Settings.

compressionmemoryoptimizationkv-cache
Compression layerMIT

U-Net (rembg)

by danielgatis

Background Removal

Remove backgrounds from images to create transparent overlays. Useful for speaker cutouts, product shots, and compositing.

background-removalcompositing
176M paramsMIT

noisereduce

by Tim Sainburg

Audio Processing

Spectral gating noise reduction for cleaning up audio recordings. Removes background noise, hiss, and hum with adjustable strength.

audiodenoisingcleanup
< 1M paramsMIT