Introduction

Imagine scrubbing through hours of video content, painstakingly trying to locate that one golden moment when the CEO mentioned the company's new strategy, but you don't remember what they said. Just that they were enthusiastic. Your wrist aches from constant rewinding, and you're growing frustrated. Now, what if I told you there's a technology that could pinpoint that exact moment in seconds?

Semantic video search is an emerging technology that has the potential to revolutionize how we interact with videos. By allowing users to search for conceptual content within videos, it offers a way to bypass unnecessary fluff and get straight to the point.

Semantic video search extracts the content and internal features within a video to determine the presence of specific words, objects, themes, topics, sentiments, characters, or entities, rendering media clips easy to query, discover, and retrieve.

With its wide range of applications across various industries, including social media listening, brand insights, podcasts, and user-generated content, semantic video search is a powerful tool for video analysis and understanding video content in context.

How it Works

Semantic video search leverages machine learning and natural language processing algorithms trained on extensive datasets of video content. These algorithms analyze the content and internal features within a video to identify specific elements, resulting in a highly accurate search and retrieval of video content.

Firstly, we need to extract frames from a video. Here's a simple way to do it using OpenCV in Python:

# Import the OpenCV library
import cv2

# Define a function to extract frames from a video
def extract_frames(video_path, frame_rate):

    # Create a VideoCapture object
    vidcap = cv2.VideoCapture(video_path)
    
    # Initialize the frame counter
    count = 0

    # Read the first frame from the video
    success, image = vidcap.read()

    # Initialize an empty list to hold the extracted frames
    frames = []

    # Loop over each frame in the video
    while success:
        # If the current frame is a multiple of the frame rate, add it to the frames list
        if count % frame_rate == 0:
            frames.append(image)

        # Read the next frame from the video
        success, image = vidcap.read()

        # Increment the frame counter
        count += 1

    # Return the list of extracted frames
    return frames

Then, we need to use CLIP to embed both the text queries and video frames into a shared semantic space. Here's a simple way to do it using PyTorch:

# Import the necessary libraries
import torch
from PIL import Image
from torchvision.transforms import Compose, Resize, Normalize, ToTensor
from openai_clip import clip

# Load the model
# Use CUDA if available, else use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the pretrained CLIP model and the associated preprocessing function
model, preprocess = clip.load("ViT-B/32", device=device)

# Define a preprocessing function to convert video frames into a format suitable for the model
def preprocess_frame(frame):
    # Convert the frame to a PIL Image and apply the preprocessing
    return preprocess(Image.fromarray(frame))

# Define a function to convert text into a format suitable for the model, and get its embedding
def embed_text(text):
    # Run the model in no_grad mode to prevent the computation graph from being built (saves memory)
    with torch.no_grad():
        # Tokenize the text and move it to the appropriate device
        text_encoded = clip.tokenize([text]).to(device)
        
        # Pass the tokenized text through the model to get the text embeddings
        text_embedding = model.encode_text(text_encoded)
    return text_embedding

# Define a function to get the embeddings of a list of frames
def embed_frames(frames):
    # Initialize an empty list to store the frame embeddings
    frame_embeddings = []
    
    for frame in frames:
        # Run the model in no_grad mode to prevent the computation graph from being built (saves memory)
        with torch.no_grad():
            # Preprocess the frame and add a batch dimension
            frame_preprocessed = preprocess_frame(frame).unsqueeze(0).to(device)
            
            # Pass the preprocessed frame through the model to get the frame embeddings
            frame_embedding = model.encode_image(frame_preprocessed)
            
        # Add the frame embeddings to the list
        frame_embeddings.append(frame_embedding)
        
    # Stack the list of frame embeddings into a single tensor
    return torch.stack(frame_embeddings)

Finally, we need to compute the similarity between the text query and each frame, and return the frame with the highest similarity. Here's how we can do it:

def semantic_search(text, video_path):
    # Extract frames from the video
    frames = extract_frames(video_path, frame_rate=30)
    
    # Embed the text and the frames
    text_embedding = embed_text(text)
    frame_embeddings = embed_frames(frames)
    
    # Compute the similarity between the text and each frame
    similarities = (text_embedding @ frame_embeddings.T).squeeze(0)
    
    # Return the frame with the highest similarity
    most_similar_frame_index = similarities.argmax().item()
    return frames[most_similar_frame_index]

Now be sure to run this on a GPU, store it in a vector database, re-train the model, batch update the embeddings and do all so at scale.

All this in Two Lines of Code

In addition to the contextual search of audio, image, and text, Mixpeek encapsulates everything above in addition to model fine-tuning at scale in a simple-to-use API:

mixpeek.index(["file_1.mp4", "file_2.mp4"])

mixpeek.search("good sportsmanship")

Sample response:

[
  {
    "content_id": "6452f04d4c0c0888bdc6b97c",
    "metadata": {
      "file_ext": "mp4",
      "file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
      "file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
      "filename": "CR-9146f0.mp4",
    },
    "score": 0.636489987373352,
    "timestamps": [
      2.5035398230088495,
      1.2517699115044247,
      3.755309734513274
    ]
  }
]

And an example of what's possible:

So Many Use Cases

Educational Content Search: A student studying for an exam could use the semantic video search to locate specific content within a series of lengthy lecture videos. For instance, they could search for the term "mitochondria functions" to find all relevant discussions within a biology course.

Journalism and Documentary Research: Journalists or documentary filmmakers could use the tool to locate specific quotes or moments within hours of interview footage. For example, a journalist could search for "witness account of the event" to locate and verify statements made by interviewees.

Media Monitoring and Analysis: Businesses could use the tool for media monitoring, analyzing how often their brand or products are mentioned in videos. For example, a company could search for "Brand X reviews" to find relevant content within a vast array of product review videos.

Legal Case Preparation: Lawyers could use the tool to find specific statements or discussions within hours of deposition or courtroom footage. For instance, they might search for "defendant's testimony about the incident" to quickly locate relevant segments.

Film and TV Production: Film and TV producers could use the tool to find specific scenes or lines within a massive database of raw footage or scripts. For example, a director might search for "scenes in the rain" when compiling a mood board or reference reel.

Content Creation and Editing: YouTube content creators or video editors could use the tool to locate specific clips within their video archives. For instance, they could search for "laughing moments" or "funny fails" when compiling a highlight reel.

Law Enforcement and Security: Law enforcement agencies could use the tool to locate specific content within surveillance footage. For example, they might search for "red car entering the parking lot" to find relevant footage quickly.

Healthcare and Medical Research: Medical professionals could use the tool to find specific content within educational or research videos. For example, a surgeon could search for "laparoscopic appendectomy procedure" within a database of surgical procedure videos to review the process before surgery

About the author
Ethan Steininger

Ethan Steininger

Former GTM Lead of MongoDB's NLP platform, Atlas Search. Occasionally off the grid in his self-converted camper van.

Multimodal Makers | Mixpeek

Multimodal Pipelines for AI

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.