Create a Video Intro Editor with Python on Mac with Silero VAD

Learn how to build an AI-powered video intro generator using Python, MoviePy, and Silero VAD that automatically detects key speaking moments and creates 3D-styled intros.

Create a Video Intro Editor with Python on Mac with Silero VAD

Video intros take time to create manually. This guide shows you how to build a Python script that analyzes your footage, detects when people are speaking using Voice Activity Detection (VAD), and generates a 3D-styled intro montage.

The script pulls speaking moments from your existing footage and turns them into fast-paced clips with perspective effects.

What You'll Learn

  • Using Silero VAD for speech detection
  • Working with MoviePy for video composition and effects
  • Creating 3D perspective effects with FFmpeg
  • Running Python scripts with uv

Prerequisites


Before we dive into the script, you’ll need to have the following installed on your Mac:

FFmpeg

FFmpeg is essential for video processing. Install it using Homebrew:

brew install ffmpeg

uv Package Manager

We’ll use uv to run our script with all dependencies automatically managed. If you don’t have uv installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

New to uv?

If you’re new to uv, check out our comprehensive guide Getting Started with uv: Setting Up Your Python Project to understand how it simplifies Python project management.

How the Video Intro Editor Works


The script works through these steps:

  1. Extracts audio from your video using FFmpeg
  2. Uses Silero VAD to identify when someone is speaking
  3. Selects the best speaking moments for the intro
  4. Applies perspective warping and blur effects
  5. Combines blurred background with 3D-warped foreground clips
  6. Renders the intro with crossfade transitions

The Complete Script


Here’s the full Python script that creates video intros. Save this as intro_generator.py:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "openai-whisper",
#     "torch",
#     "torchaudio<2.6",
#     "soundfile",
#     "numpy<2.0.0",
#     "moviepy==1.0.3",
#     "packaging",
#     "Pillow<10.0.0",
# ]
# ///

import os
import sys
import warnings

# 1. Suppress warnings BEFORE importing moviepy
# This hides the "invalid escape sequence" text
warnings.filterwarnings("ignore")

import random
import subprocess
from pathlib import Path

import torch
import whisper
from moviepy.editor import (
    CompositeVideoClip,
    VideoFileClip,
    concatenate_videoclips,
    vfx,
)

# --- Configuration ---
MIN_SILENCE = 0.5
MIN_SPEECH = 0.25
PADDING = 0.1

# Intro Style Settings
INTRO_CLIP_COUNT = 6  # Number of fast cuts
INTRO_SPEED = 3.0  # Speed multiplier (3x fast)
CLIP_DURATION = 1.5  # Duration of each clip in seconds
OUTPUT_FILENAME = "intro_only.mp4"


def check_ffmpeg():
    try:
        subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
    except:
        print("❌ Error: FFmpeg not found. Run: brew install ffmpeg")
        sys.exit(1)


def extract_audio(video_path, audio_path):
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            video_path,
            "-vn",
            "-acodec",
            "pcm_s16le",
            "-ar",
            "16000",
            "-ac",
            "1",
            "-loglevel",
            "error",
            audio_path,
        ],
        check=True,
    )


def get_good_segments(audio_path):
    """Finds segments where people are actually speaking (Key Moments)."""
    print("🧠 Scanning video for key moments...")

    # Load Silero VAD
    # trust_repo=True fixes the "cache" warning/error
    model, utils = torch.hub.load(
        repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True
    )
    (get_speech_timestamps, _, read_audio, _, _) = utils

    wav = read_audio(audio_path)
    vad_stamps = get_speech_timestamps(
        wav,
        model,
        threshold=0.5,
        min_speech_duration_ms=int(MIN_SPEECH * 1000),
        min_silence_duration_ms=int(MIN_SILENCE * 1000),
    )

    segments = []
    for v in vad_stamps:
        segments.append((v["start"] / 16000, v["end"] / 16000))

    return segments


def apply_blur_background(input_path, output_path):
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            input_path,
            "-vf",
            "boxblur=40:5,eq=brightness=-0.4",
            "-c:v",
            "libx264",
            "-preset",
            "ultrafast",
            "-an",
            "-loglevel",
            "error",
            output_path,
        ],
        check=True,
    )


def apply_3d_warp(input_path, output_path, direction="left"):
    if direction == "left":
        vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination"
    else:
        vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination"

    vf += ",pad=w=iw+100:h=ih+100:x=50:y=50:color=black@0"

    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            input_path,
            "-vf",
            vf,
            "-c:v",
            "libx264",
            "-preset",
            "ultrafast",
            "-an",
            "-loglevel",
            "error",
            output_path,
        ],
        check=True,
    )


def generate_intro(video_path, segments):
    print("✨ Rendering 3D Intro (No Text)...")

    long_segments = [s for s in segments if (s[1] - s[0]) > 2.0]

    if len(long_segments) < INTRO_CLIP_COUNT:
        print(f"⚠️ Not enough footage found. Need {INTRO_CLIP_COUNT} distinct moments.")
        if not long_segments:
            return
        picks = random.choices(long_segments, k=INTRO_CLIP_COUNT)
    else:
        picks = sorted(random.sample(long_segments, INTRO_CLIP_COUNT))

    intro_clips = []
    temp_files = []

    for i, (start, end) in enumerate(picks):
        raw_dur = CLIP_DURATION * INTRO_SPEED
        mid = start + (end - start) / 2 - (raw_dur / 2)

        raw_clip = f"temp_raw_{i}.mp4"
        bg_clip = f"temp_bg_{i}.mp4"
        fg_clip = f"temp_fg_{i}.mp4"

        # 1. Extract
        subprocess.run(
            [
                "ffmpeg",
                "-y",
                "-ss",
                str(mid),
                "-t",
                str(raw_dur),
                "-i",
                video_path,
                "-c:v",
                "libx264",
                "-an",
                "-loglevel",
                "error",
                raw_clip,
            ],
            check=True,
        )

        # 2. Process
        apply_blur_background(raw_clip, bg_clip)

        direction = "left" if i % 2 == 0 else "right"
        apply_3d_warp(raw_clip, fg_clip, direction)

        # 3. Composite
        try:
            bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED)
            fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED)

            if direction == "left":
                fg = fg.set_position(
                    lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center")
                )
            else:
                fg = fg.set_position(
                    lambda t: (int(50 - 50 * (t / CLIP_DURATION)), "center")
                )

            comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION)
            if i > 0:
                comp = comp.crossfadein(0.2)

            intro_clips.append(comp)
            temp_files.extend([raw_clip, bg_clip, fg_clip])

        except Exception as e:
            print(f"   ⚠️ Error processing clip {i}: {e}")

    if not intro_clips:
        print("❌ Failed to generate intro clips.")
        return

    # Concatenate
    full_montage = concatenate_videoclips(intro_clips, method="compose")

    print("   💾 Saving video file...")
    full_montage.write_videofile(
        "temp_visual_intro.mp4", fps=24, codec="libx264", logger=None
    )

    # Add silent audio
    print("   🔊 Adding silent audio track...")
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            "temp_visual_intro.mp4",
            "-f",
            "lavfi",
            "-i",
            "anullsrc=channel_layout=mono:sample_rate=44100",
            "-c:v",
            "copy",
            "-c:a",
            "aac",
            "-shortest",
            "-loglevel",
            "error",
            OUTPUT_FILENAME,
        ],
        check=True,
    )

    # Cleanup
    for f in temp_files:
        if os.path.exists(f):
            os.remove(f)
    if os.path.exists("temp_visual_intro.mp4"):
        os.remove("temp_visual_intro.mp4")

    print(f"✅ Success! Intro saved as: {OUTPUT_FILENAME}")


def main():
    if len(sys.argv) < 2:
        print("Usage: uv run intro_generator.py <video.mp4>")
        sys.exit(1)

    input_video = sys.argv[1]
    check_ffmpeg()

    temp_wav = "temp_analysis.wav"

    try:
        extract_audio(input_video, temp_wav)

        good_parts = get_good_segments(temp_wav)

        if not good_parts:
            print("⚠️ No speech detected. Picking random segments...")
            duration = float(
                subprocess.check_output(
                    [
                        "ffprobe",
                        "-v",
                        "error",
                        "-show_entries",
                        "format=duration",
                        "-of",
                        "default=noprint_wrappers=1:nokey=1",
                        input_video,
                    ]
                )
            )
            good_parts = [(t, t + 5) for t in range(0, int(duration), 10)]

        generate_intro(input_video, good_parts)

    finally:
        if os.path.exists(temp_wav):
            os.remove(temp_wav)


if __name__ == "__main__":
    main()

Understanding the Script


The script uses PEP 723 inline metadata to tell uv which dependencies to install:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "openai-whisper",
#     "torch",
#     "torchaudio<2.6",
#     "soundfile",
#     "numpy<2.0.0",
#     "moviepy==1.0.3",
#     "packaging",
#     "Pillow<10.0.0",
# ]
# ///

This format lets uv install the required packages automatically. See our guide on Running Test Scripts with uv for more on this pattern.

Configuration Variables

VariableDefaultDescription
MIN_SILENCE0.5Minimum silence duration in seconds
MIN_SPEECH0.25Minimum speech duration in seconds
INTRO_CLIP_COUNT6Number of clips in the intro
INTRO_SPEED3.0Speed multiplier for clips
CLIP_DURATION1.5Duration of each clip in seconds
OUTPUT_FILENAME”intro_only.mp4”Output file name

Voice Activity Detection (VAD)

The get_good_segments() function detects speech using Silero VAD:

def get_good_segments(audio_path):
    """Finds segments where people are actually speaking (Key Moments)."""
    model, utils = torch.hub.load(
        repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True
    )
    (get_speech_timestamps, _, read_audio, _, _) = utils

    wav = read_audio(audio_path)
    vad_stamps = get_speech_timestamps(
        wav,
        model,
        threshold=0.5,
        min_speech_duration_ms=int(MIN_SPEECH * 1000),
        min_silence_duration_ms=int(MIN_SILENCE * 1000),
    )
    
    return [(v["start"] / 16000, v["end"] / 16000) for v in vad_stamps]

Silero VAD runs locally and doesn’t need an API key or internet connection after the initial download.

3D Perspective Effects

The script creates a 3D look using FFmpeg’s perspective filter:

def apply_3d_warp(input_path, output_path, direction="left"):
    if direction == "left":
        vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination"
    else:
        vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination"

This alternates between left and right perspective warps.

Video Compositing

The script layers a blurred background with a 3D-warped foreground:

bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED)
fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED)

if direction == "left":
    fg = fg.set_position(
        lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center")
    )

comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION)

Running the Script


With uv installed, run:

uv run intro_generator.py your_video.mp4

The first run takes longer as uv downloads dependencies. Subsequent runs use cached packages.

Expected Output

🧠 Scanning video for key moments...
✨ Rendering 3D Intro (No Text)...
   💾 Saving video file...
   🔊 Adding silent audio track...
✅ Success! Intro saved as: intro_only.mp4

Customizing the Output


Edit the script to change the intro style.

Change Number of Clips

Edit INTRO_CLIP_COUNT:

INTRO_CLIP_COUNT = 8  # More clips

Adjust Speed

Modify INTRO_SPEED:

INTRO_SPEED = 2.0  # Slower
INTRO_SPEED = 4.0  # Faster

Change Clip Duration

Adjust how long each clip appears:

CLIP_DURATION = 2.0  # Longer clips
CLIP_DURATION = 1.0  # Shorter clips

Modify Blur Intensity

Edit the apply_blur_background() function:

# Stronger blur
"-vf", "boxblur=60:10,eq=brightness=-0.5"

# Lighter blur
"-vf", "boxblur=20:3,eq=brightness=-0.2"

Dependencies Explained


PackagePurpose
openai-whisperSpeech recognition (used for loading audio utilities)
torchPyTorch for running the VAD model
torchaudioAudio processing with PyTorch
soundfileReading audio files
numpyNumerical operations
moviepyVideo editing and compositing
PillowImage processing (required by MoviePy)
packagingVersion handling utilities

Version Constraints

The version constraints ensure compatibility between packages. torchaudio<2.6 and numpy<2.0.0 prevent breaking changes.

Troubleshooting


FFmpeg not found error

Install FFmpeg using Homebrew:

brew install ffmpeg

Verify the installation:

ffmpeg -version
Not enough footage found warning

This happens when the video doesn’t have enough speaking segments longer than 2 seconds. The script reuses segments in this case. Try:

  • Using a longer source video
  • Reducing INTRO_CLIP_COUNT
  • Lowering the minimum segment duration in the code
Memory errors with large videos

For large videos, try:

  • Processing shorter clips
  • Reducing the source video resolution
  • Closing other applications
First run is slow

The first run downloads the Silero VAD model and Python dependencies. Subsequent runs are faster because uv and PyTorch cache these files.

Conclusion


This script combines Silero VAD for speech detection, FFmpeg for video processing, and MoviePy for compositing. The uv inline dependency management handles package installation automatically.

You can adjust the number of clips, speed, duration, and visual effects to match your style. The tool saves time on YouTube intros, social media content, or presentation openers.

For more Python scripting with uv, see: