Introducing VUSE: Video Understanding and Semantic Embedding

State-of-the art video understanding model that converts videos into embeddings.
Introducing VUSE: Video Understanding and Semantic Embedding

We are thrilled to announce the launch of our first multimodal embedding model: vuse-generic-v1. It enables text and contextual frames (videos) to share the same embedding space, coupled with:

  • a managed API that is fully scalable, secure by default and accompanied by a fully native SDK
  • Deeply integrated into your source object store and destination KNN store with custom preprocessing pipeline logic

Best of all you can own your own embeddings enabling you to run custom post-processing for KNN, enrichment, reproducibility and more. We just ensure every new object in your bucket gets indexed 😄

This permits you to do contextual video search, a capability that’s been limited by image/frame and audio embedding.

VUSE is now available for production workloads through the Mixpeek API, with 1GB of free video embeddings. For enterprise needs, we offer a fully secure and compliant Mixpeek Enterprise package.

Understanding Video Embeddings

Video embeddings play a crucial role in modern AI applications, enabling tasks such as video retrieval, action recognition, and scene understanding. These embeddings encode semantic information about videos into high-dimensional vectors used in downstream applications.

We also wrote a tutorial that walks you through buililding your own from scratch. It uses CLIP mean pooling to achieve similar results.

Leading models like:

While effective, are limited to a single modality. VUSE changes this by offering a high-performing, managed alternative that spans contextual frames (image and audio) as well as text modalities.

Training Video Encoders with VUSE

VUSE is inspired by the recent advancements in video foundation models. Unlike existing vision foundation models that focus on image-level pretraining and adaptation, VUSE addresses dynamic and complex video-level understanding tasks by leveraging both generative and discriminative self-supervised video learning.

It was trained on over 250m video clips accompanied with over 5b words.

Key Components of this training:

  1. Masked Video Modeling: VUSE uses masked video modeling to learn comprehensive video representations by predicting masked parts of the video.
  2. Video-Language Contrastive Learning: This involves aligning video representations with corresponding language descriptions, enhancing the model's ability to understand and interpret video content.
  3. Learnable Coordination: VUSE selectively coordinates video representations from the two complementary frameworks (masked video modeling and video-language contrastive learning) to boost performance across various video applications.

Building VUSE

To build VUSE, we adopted a multi-stage contrastive learning pipeline, starting from a CLIP initialization. Given CLIP's limitation of handling shorter sequences, we extended this to 30 frames, training our custom model, vuse-generic-v1.

We made several modifications:

  • Rotary Position Embeddings: For context length extrapolation.
  • SwiGLU Activations: To enhance model performance.
  • Zero Dropout: To maintain model stability.

Additional training optimizations included:

  • Deepspeed and FlashAttention: For efficient training.
  • BF16 Precision: To enhance computation speed.
  • Increased Vocab Size: To a multiple of 64.
  • Larger Batch Size: Of 4096 frames.
  • Higher Masking Rate: At 30% during masked language modeling.

We evaluated vuse-generic-v1 using standard benchmarks and found it performs comparably to other end-to-end video search services, with the added advantage of owning your embeddings.

Integrate VUSE-GENERIC-V1 into Your S3 & Vector DB Architecture

Unlock the full potential of your video data with the vuse-generic-v1 model. Our managed API makes it effortless to integrate state-of-the-art video understanding and semantic embedding capabilities directly into your existing S3 and vector database architecture. Here's why VUSE-GENERIC-V1 is the ultimate solution for your needs:

  • Unmatched Performance: Achieve superior accuracy and efficiency in video action recognition, detection, and video-language alignment tasks.
  • Scalability: Handle massive volumes of video data with ease, thanks to our robust managed API and support for long-context sequences.
  • Ease of Integration: Our API is designed for seamless integration with your current infrastructure, ensuring minimal disruption and maximum impact.

Get Started Today

Integrating VUSE-GENERIC-V1 is straightforward. Follow these steps to transform your video data workflows:

Optimize Your Data Pipeline:
Leverage our API to embed, store, and query video data efficiently in your S3 and vector DB setup.

Use the embed HTTP endpoint

curl --location 'https://api.mixpeek.com/embed' \
--header 'Authorization: Bearer API-KEY' \
--header 'Content-Type: application/json' \
--data '{
    "input": "https://video.mp4",
    "modality": "video",
    "model": "mixpeek/vuse-generic-v1"
}'

This will return a 768 dimension embedding of the video grouped by timestamp:

{
    "elapsed_time": 169.642822265625,
    "response": {
        "embedding": [
            -0.027562133967876434,
            -0.14670711755752563,
            ...
        ]
    }
}

You can also use the Python SDK

# generate video embedding and insert it
video_embedding = mixpeek.embed.video("http://video.mp4")
pinecone.collection.insert(video_embedding)

# generate query embedding
query_embedding = mixpeek.embed.video("human connection")

# do your KNN on a query or a video
pinecone.collection.vector_search(query_embedding, k=10)

With the SDK you can do a video search, or a reverse video search on your existing embeddings.

Sync Your S3 Videos into Your Vector DB with Mixpeek's Multimodal Pipeline

If you’d like a completely hands off extraction, embedding and generation pipeline you can integrate mixpeek into your S3 and vector database of choice.

Contact us today to learn more about integrating vuse-generic-v1and Mixpeek's multimodal pipeline into your architecture.

About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Multimodal Pipelines for AI

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.