0.1.0
- New visual feature extraction models (applicable with any image or video
interval_sec
) for the /index
pipeline. All are accessible standalone with /understand/<method>
:read
: Grab the text from the screenembed
: Create a multimodal embedding of the interval_sec
(if video) or imagedetect.faces
: Saves to keys:detected_face_ids
: returns face_ids for each face that has been registered via /index/face
face_details
: returns various characteristics about faces that were found in the visual asset
json_output
: More of a catch-all for structured data generation, supports all standard json types
{
"url": "video.mp4",
"collection_id": "name",
"should_save": false,
"video_settings": [
{
"interval_sec": 1,
"read": {
"model_id": "video-descriptor-v1"
},
"embed": {
"model_id": "multimodal-v1"
},
"detect": {
"faces": {
"model_id": "face-detector-v1"
}
},
"json_output": {
"response_shape": {
"emotions": [
"str",
"str"
]
},
"prompt": "This is a list of emotion labels, each one should be a string representing the scene."
}
},
{
"interval_sec": 30,
"transcribe": {
"model_id": "polyglot-v1"
}
},
{
"interval_sec": 120,
"describe": {
"model_id": "video-descriptor-v1",
"prompt": "Create a holistic description of the video, include sounds and screenplay"
}
}
]
}
- Embed image and video now share the same embedding space (1408 dimensions), making multimodal search more accurate and faster. Uses
multimodal-v1
model_id /collections/file/<file_id>/full
now returns the full features pulled out from the indexing pipeline
{
"index_id": "ix-...",
"file_id": "...,
"collection_id": "name",
"status": "DONE",
"url": "video.mp4",
"metadata": {
"preview_url": "thumbnail.jpg"
},
"created_at": "2024-09-30T15:48:35.219000",
"video_segments": [
{
"interval_sec": 1,
"segments": [
{
"start_time": 1.0,
"end_time": 2.0,
"transcription": null,
"description": null,
"text": "\n\n© copyright \n",
"detect": {
"faces": {
"detected_face_ids": ["123"],
"face_details": [{}]
}
}
}
]
},
{
"interval_sec": 30,
"segments": [
{
"start_time": 0.0,
"end_time": 2.8333333333333335,
"transcription": "Something "
}
]
},
{
"interval_sec": 120,
"segments": [
{
"start_time": 0.0,
"end_time": 2.8333333333333335,
"description": "The video starts ..."
}
]
}
]
}