--- title: "Create a Video Intro Editor with Python on Mac with Silero VAD" description: "Learn how to build an AI-powered video intro generator using Python, MoviePy, and Silero VAD that automatically detects key speaking moments and creates 3D-styled intros." date: 2025-12-09 categories: ["vps"] tags: ["Python","uv"] --- Video intros take time to create manually. This guide shows you how to build a Python script that analyzes your footage, detects when people are speaking using Voice Activity Detection (VAD), and generates a 3D-styled intro montage. The script pulls speaking moments from your existing footage and turns them into fast-paced clips with perspective effects. - Using Silero VAD for speech detection - Working with MoviePy for video composition and effects - Creating 3D perspective effects with FFmpeg - Running Python scripts with uv ## Prerequisites --- Before we dive into the script, you'll need to have the following installed on your Mac: ### FFmpeg FFmpeg is essential for video processing. Install it using Homebrew: ```bash brew install ffmpeg ``` ### uv Package Manager We'll use `uv` to run our script with all dependencies automatically managed. If you don't have uv installed: ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` If you're new to uv, check out our comprehensive guide [Getting Started with uv: Setting Up Your Python Project](https://www.bitdoze.com/uv-get-start/) to understand how it simplifies Python project management. ## How the Video Intro Editor Works --- The script works through these steps: 1. Extracts audio from your video using FFmpeg 2. Uses Silero VAD to identify when someone is speaking 3. Selects the best speaking moments for the intro 4. Applies perspective warping and blur effects 5. Combines blurred background with 3D-warped foreground clips 6. Renders the intro with crossfade transitions ## The Complete Script --- Here's the full Python script that creates video intros. Save this as `intro_generator.py`: ```python # /// script # requires-python = ">=3.10" # dependencies = [ # "openai-whisper", # "torch", # "torchaudio<2.6", # "soundfile", # "numpy<2.0.0", # "moviepy==1.0.3", # "packaging", # "Pillow<10.0.0", # ] # /// import os import sys import warnings # 1. Suppress warnings BEFORE importing moviepy # This hides the "invalid escape sequence" text warnings.filterwarnings("ignore") import random import subprocess from pathlib import Path import torch import whisper from moviepy.editor import ( CompositeVideoClip, VideoFileClip, concatenate_videoclips, vfx, ) # --- Configuration --- MIN_SILENCE = 0.5 MIN_SPEECH = 0.25 PADDING = 0.1 # Intro Style Settings INTRO_CLIP_COUNT = 6 # Number of fast cuts INTRO_SPEED = 3.0 # Speed multiplier (3x fast) CLIP_DURATION = 1.5 # Duration of each clip in seconds OUTPUT_FILENAME = "intro_only.mp4" def check_ffmpeg(): try: subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True) except: print("❌ Error: FFmpeg not found. Run: brew install ffmpeg") sys.exit(1) def extract_audio(video_path, audio_path): subprocess.run( [ "ffmpeg", "-y", "-i", video_path, "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", "-loglevel", "error", audio_path, ], check=True, ) def get_good_segments(audio_path): """Finds segments where people are actually speaking (Key Moments).""" print("🧠 Scanning video for key moments...") # Load Silero VAD # trust_repo=True fixes the "cache" warning/error model, utils = torch.hub.load( repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True ) (get_speech_timestamps, _, read_audio, _, _) = utils wav = read_audio(audio_path) vad_stamps = get_speech_timestamps( wav, model, threshold=0.5, min_speech_duration_ms=int(MIN_SPEECH * 1000), min_silence_duration_ms=int(MIN_SILENCE * 1000), ) segments = [] for v in vad_stamps: segments.append((v["start"] / 16000, v["end"] / 16000)) return segments def apply_blur_background(input_path, output_path): subprocess.run( [ "ffmpeg", "-y", "-i", input_path, "-vf", "boxblur=40:5,eq=brightness=-0.4", "-c:v", "libx264", "-preset", "ultrafast", "-an", "-loglevel", "error", output_path, ], check=True, ) def apply_3d_warp(input_path, output_path, direction="left"): if direction == "left": vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination" else: vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination" vf += ",pad=w=iw+100:h=ih+100:x=50:y=50:color=black@0" subprocess.run( [ "ffmpeg", "-y", "-i", input_path, "-vf", vf, "-c:v", "libx264", "-preset", "ultrafast", "-an", "-loglevel", "error", output_path, ], check=True, ) def generate_intro(video_path, segments): print("✨ Rendering 3D Intro (No Text)...") long_segments = [s for s in segments if (s[1] - s[0]) > 2.0] if len(long_segments) < INTRO_CLIP_COUNT: print(f"⚠️ Not enough footage found. Need {INTRO_CLIP_COUNT} distinct moments.") if not long_segments: return picks = random.choices(long_segments, k=INTRO_CLIP_COUNT) else: picks = sorted(random.sample(long_segments, INTRO_CLIP_COUNT)) intro_clips = [] temp_files = [] for i, (start, end) in enumerate(picks): raw_dur = CLIP_DURATION * INTRO_SPEED mid = start + (end - start) / 2 - (raw_dur / 2) raw_clip = f"temp_raw_{i}.mp4" bg_clip = f"temp_bg_{i}.mp4" fg_clip = f"temp_fg_{i}.mp4" # 1. Extract subprocess.run( [ "ffmpeg", "-y", "-ss", str(mid), "-t", str(raw_dur), "-i", video_path, "-c:v", "libx264", "-an", "-loglevel", "error", raw_clip, ], check=True, ) # 2. Process apply_blur_background(raw_clip, bg_clip) direction = "left" if i % 2 == 0 else "right" apply_3d_warp(raw_clip, fg_clip, direction) # 3. Composite try: bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED) fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED) if direction == "left": fg = fg.set_position( lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center") ) else: fg = fg.set_position( lambda t: (int(50 - 50 * (t / CLIP_DURATION)), "center") ) comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION) if i > 0: comp = comp.crossfadein(0.2) intro_clips.append(comp) temp_files.extend([raw_clip, bg_clip, fg_clip]) except Exception as e: print(f" ⚠️ Error processing clip {i}: {e}") if not intro_clips: print("❌ Failed to generate intro clips.") return # Concatenate full_montage = concatenate_videoclips(intro_clips, method="compose") print(" 💾 Saving video file...") full_montage.write_videofile( "temp_visual_intro.mp4", fps=24, codec="libx264", logger=None ) # Add silent audio print(" 🔊 Adding silent audio track...") subprocess.run( [ "ffmpeg", "-y", "-i", "temp_visual_intro.mp4", "-f", "lavfi", "-i", "anullsrc=channel_layout=mono:sample_rate=44100", "-c:v", "copy", "-c:a", "aac", "-shortest", "-loglevel", "error", OUTPUT_FILENAME, ], check=True, ) # Cleanup for f in temp_files: if os.path.exists(f): os.remove(f) if os.path.exists("temp_visual_intro.mp4"): os.remove("temp_visual_intro.mp4") print(f"✅ Success! Intro saved as: {OUTPUT_FILENAME}") def main(): if len(sys.argv) < 2: print("Usage: uv run intro_generator.py ") sys.exit(1) input_video = sys.argv[1] check_ffmpeg() temp_wav = "temp_analysis.wav" try: extract_audio(input_video, temp_wav) good_parts = get_good_segments(temp_wav) if not good_parts: print("⚠️ No speech detected. Picking random segments...") duration = float( subprocess.check_output( [ "ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "default=noprint_wrappers=1:nokey=1", input_video, ] ) ) good_parts = [(t, t + 5) for t in range(0, int(duration), 10)] generate_intro(input_video, good_parts) finally: if os.path.exists(temp_wav): os.remove(temp_wav) if __name__ == "__main__": main() ``` ## Understanding the Script --- The script uses PEP 723 inline metadata to tell `uv` which dependencies to install: ```python # /// script # requires-python = ">=3.10" # dependencies = [ # "openai-whisper", # "torch", # "torchaudio<2.6", # "soundfile", # "numpy<2.0.0", # "moviepy==1.0.3", # "packaging", # "Pillow<10.0.0", # ] # /// ``` This format lets `uv` install the required packages automatically. See our guide on [Running Test Scripts with uv](https://www.bitdoze.com/uv-run-scripts-guide/) for more on this pattern. ### Configuration Variables | Variable | Default | Description | |----------|---------|-------------| | `MIN_SILENCE` | 0.5 | Minimum silence duration in seconds | | `MIN_SPEECH` | 0.25 | Minimum speech duration in seconds | | `INTRO_CLIP_COUNT` | 6 | Number of clips in the intro | | `INTRO_SPEED` | 3.0 | Speed multiplier for clips | | `CLIP_DURATION` | 1.5 | Duration of each clip in seconds | | `OUTPUT_FILENAME` | "intro_only.mp4" | Output file name | ### Voice Activity Detection (VAD) The `get_good_segments()` function detects speech using Silero VAD: ```python def get_good_segments(audio_path): """Finds segments where people are actually speaking (Key Moments).""" model, utils = torch.hub.load( repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True ) (get_speech_timestamps, _, read_audio, _, _) = utils wav = read_audio(audio_path) vad_stamps = get_speech_timestamps( wav, model, threshold=0.5, min_speech_duration_ms=int(MIN_SPEECH * 1000), min_silence_duration_ms=int(MIN_SILENCE * 1000), ) return [(v["start"] / 16000, v["end"] / 16000) for v in vad_stamps] ``` Silero VAD runs locally and doesn't need an API key or internet connection after the initial download. ### 3D Perspective Effects The script creates a 3D look using FFmpeg's perspective filter: ```python def apply_3d_warp(input_path, output_path, direction="left"): if direction == "left": vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination" else: vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination" ``` This alternates between left and right perspective warps. ### Video Compositing The script layers a blurred background with a 3D-warped foreground: ```python bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED) fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED) if direction == "left": fg = fg.set_position( lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center") ) comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION) ``` ## Running the Script --- With `uv` installed, run: ```bash uv run intro_generator.py your_video.mp4 ``` The first run takes longer as `uv` downloads dependencies. Subsequent runs use cached packages. ### Expected Output ``` 🧠 Scanning video for key moments... ✨ Rendering 3D Intro (No Text)... 💾 Saving video file... 🔊 Adding silent audio track... ✅ Success! Intro saved as: intro_only.mp4 ``` ## Customizing the Output --- Edit the script to change the intro style. ### Change Number of Clips Edit `INTRO_CLIP_COUNT`: ```python INTRO_CLIP_COUNT = 8 # More clips ``` ### Adjust Speed Modify `INTRO_SPEED`: ```python INTRO_SPEED = 2.0 # Slower INTRO_SPEED = 4.0 # Faster ``` ### Change Clip Duration Adjust how long each clip appears: ```python CLIP_DURATION = 2.0 # Longer clips CLIP_DURATION = 1.0 # Shorter clips ``` ### Modify Blur Intensity Edit the `apply_blur_background()` function: ```python # Stronger blur "-vf", "boxblur=60:10,eq=brightness=-0.5" # Lighter blur "-vf", "boxblur=20:3,eq=brightness=-0.2" ``` ## Dependencies Explained --- | Package | Purpose | |---------|---------| | **openai-whisper** | Speech recognition (used for loading audio utilities) | | **torch** | PyTorch for running the VAD model | | **torchaudio** | Audio processing with PyTorch | | **soundfile** | Reading audio files | | **numpy** | Numerical operations | | **moviepy** | Video editing and compositing | | **Pillow** | Image processing (required by MoviePy) | | **packaging** | Version handling utilities | The version constraints ensure compatibility between packages. `torchaudio<2.6` and `numpy<2.0.0` prevent breaking changes. ## Troubleshooting --- Install FFmpeg using Homebrew: ```bash brew install ffmpeg ``` Verify the installation: ```bash ffmpeg -version ``` This happens when the video doesn't have enough speaking segments longer than 2 seconds. The script reuses segments in this case. Try: - Using a longer source video - Reducing `INTRO_CLIP_COUNT` - Lowering the minimum segment duration in the code For large videos, try: - Processing shorter clips - Reducing the source video resolution - Closing other applications The first run downloads the Silero VAD model and Python dependencies. Subsequent runs are faster because `uv` and PyTorch cache these files. ## Conclusion --- This script combines Silero VAD for speech detection, FFmpeg for video processing, and MoviePy for compositing. The `uv` inline dependency management handles package installation automatically. You can adjust the number of clips, speed, duration, and visual effects to match your style. The tool saves time on YouTube intros, social media content, or presentation openers. For more Python scripting with `uv`, see: - [Getting Started with uv: Setting Up Your Python Project](https://www.bitdoze.com/uv-get-start/) - [Running Test Scripts with uv: No Dependencies Management Required](https://www.bitdoze.com/uv-run-scripts-guide/)