Release Notes

0.1.0

  • New visual feature extraction models (applicable with any image or video interval_sec) for the /index pipeline. All are accessible standalone with /understand/<method>:
    • read: Grab the text from the screen
    • embed: Create a multimodal embedding of the interval_sec (if video) or image
    • detect.faces: Saves to keys:
      • detected_face_ids: returns face_ids for each face that has been registered via /index/face
      • face_details: returns various characteristics about faces that were found in the visual asset
    • json_output: More of a catch-all for structured data generation, supports all standard json types
{
    "url": "video.mp4",
    "collection_id": "name",
    "should_save": false,
    "video_settings": [
        {
            "interval_sec": 1,
            "read": {
                "model_id": "video-descriptor-v1"
            },
            "embed": {
                "model_id": "multimodal-v1"
            },
            "detect": {
                "faces": {
                    "model_id": "face-detector-v1"
                }
            },
            "json_output": {
                "response_shape": {
                    "emotions": [
                        "str",
                        "str"
                    ]
                },
                "prompt": "This is a list of emotion labels, each one should be a string representing the scene."
            }
        },
        {
            "interval_sec": 30,
            "transcribe": {
                "model_id": "polyglot-v1"
            }
        },
        {
            "interval_sec": 120,
            "describe": {
                "model_id": "video-descriptor-v1",
                "prompt": "Create a holistic description of the video, include sounds and screenplay"
            }
        }
    ]
}
  • Embed image and video now share the same embedding space (1408 dimensions), making multimodal search more accurate and faster. Uses multimodal-v1 model_id
  • /collections/file/<file_id>/full now returns the full features pulled out from the indexing pipeline
{
    "index_id": "ix-...",
    "file_id": "...,
    "collection_id": "name",
    "status": "DONE",
    "url": "video.mp4",
    "metadata": {
        "preview_url": "thumbnail.jpg"
    },
    "created_at": "2024-09-30T15:48:35.219000",
    "video_segments": [
        {
            "interval_sec": 1,
            "segments": [
                {
                    "start_time": 1.0,
                    "end_time": 2.0,
                    "transcription": null,
                    "description": null,
                    "text": "\n\n© copyright \n",
                    "detect": {
                        "faces": {
                            "detected_face_ids": ["123"],
                            "face_details": [{}]
                        }
                    }
                }
            ]
        },
        {
            "interval_sec": 30,
            "segments": [
                {
                    "start_time": 0.0,
                    "end_time": 2.8333333333333335,
                    "transcription": "Something "
                }
            ]
        },
        {
            "interval_sec": 120,
            "segments": [
                {
                    "start_time": 0.0,
                    "end_time": 2.8333333333333335,
                    "description": "The video starts ..."
                }
            ]
        }
    ]
}

Multimodal Makers | Mixpeek

Multimodal Pipelines for AI

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.