Skip to content
videodb
VideoDB Documentation
  • Pages
    • Welcome to VideoDB Docs
    • Quick Start Guide
      • Video Indexing Guide
      • Semantic Search
      • Collections
      • Public Collections
      • Callback Details
      • Ref: Subtitle Styles
      • Language Support
      • Guide: Subtitles
      • How Accurate is Your Search?
    • Visual Search and Indexing
      • Scene Extraction Algorithms
      • Custom Annotations
      • Scene-Level Metadata: Smarter Video Search & Retrieval
      • icon picker
        Advanced Visual Search Pipelines
      • Playground for Scene Extractions
      • Deep Dive into Prompt Engineering : Mastering Visual Indexing
      • How VideoDB Solves Complex Visual Analysis Tasks
      • Multimodal Search: Quickstart
      • Conference Slide Scraper with VideoDB
    • Examples and Tutorials
      • Dubbing - Replace Soundtrack with New Audio
      • VideoDB: Adding AI Generated voiceovers to silent footage
      • Beep curse words in real-time
      • Remove Unwanted Content from videos
      • Instant Clips of Your Favorite Characters
      • Insert Dynamic Ads in real-time
      • Adding Brand Elements with VideoDB
      • Elevating Trailers with Automated Narration
      • Add Intro/Outro to Videos
      • Audio overlay + Video + Timeline
      • Building Dynamic Video Streams with VideoDB: Integrating Custom Data and APIs
      • AI Generated Ad Films for Product Videography
      • Fun with Keyword Search
      • Overlay a Word-Counter on Video Stream
      • Generate Automated Video Outputs with Text Prompts | VideoDB
      • VideoDB x TwelveLabs: Real-Time Video Understanding
      • Multimodal Search
      • How I Built a CRM-integrated Sales Assistant Agent in 1 Hour
      • Make Your Video Sound Studio Quality with Voice Cloning
      • Automated Traffic Violation Reporter
    • Live Video→ Instant Action
    • Generative Media Quickstart
      • Generative Media Pricing
    • Video Editing Automation
      • Fit & Position: Aspect Ratio Control
      • Trimming vs Timing: Two Independent Timelines
      • Advanced Clip Control: The Composition Layer
      • Caption & Subtitles: Auto-Generated Speech Synchronization
      • Notebooks
    • Transcoding Quickstart
    • director-light
      Director - Video Agent Framework
      • Agent Creation Playbook
      • Setup Director Locally
    • Workflows and Integrations
      • zapier
        Zapier Integration
        • Auto-Dub Videos & Save to Google Drive
        • Create & Add Intelligent Video Highlights to Notion
        • Create GenAI Video Engine - Notion Ideas to Youtube
        • Automatically Detect Profanity in Videos with AI - Update on Slack
        • Generate and Store YouTube Video Summaries in Notion
        • Automate Subtitle Generation for Video Libraries
        • Solve customers queries with Video Answers
      • n8n
        N8N Workflows
        • AI-Powered Meeting Intelligence: Recording to Insights Automation
        • AI Powered Dubbing Workflow for Video Content
        • Automate Subtitle Generation for Video Libraries
        • Automate Interview Evaluations with AI
        • Turn Meeting Recordings into Actionable Summaries
        • Auto-Sync Sales Calls to HubSpot CRM with AI
        • Instant Notion Summaries for Your Youtube Playlist
    • Meeting Recording SDK
    • github
      Open Source
      • llama
        LlamaIndex VideoDB Retriever
      • PromptClip: Use Power of LLM to Create Clips
      • StreamRAG: Connect ChatGPT to VideoDB
    • mcp
      VideoDB MCP Server
    • videodb
      Give your AI, Eyes and Ears
      • Building Infrastructure that “Sees” and “Edits”
      • Agents with Video Experience
      • From MP3/MP4 to the Future with VideoDB
      • Dynamic Video Streams
      • Why do we need a Video Database Now?
      • What's a Video Database ?
      • Enhancing AI-Driven Multimedia Applications
      • Beyond Traditional Video Infrastructure
    • Customer Love
    • Join us
      • videodb
        Internship: Build the Future of AI-Powered Video Infrastructure
      • Ashutosh Trivedi
        • Playlists
        • Talks - Solving Logical Puzzles with Natural Language Processing - PyCon India 2015
      • Ashish
      • Shivani Desai
      • Gaurav Tyagi
      • Rohit Garg
      • Edge of Knowledge
        • Language Models to World Models: The Next Frontier in AI
        • Society of Machines
          • Society of Machines
          • Autonomy - Do we have the choice?
          • Emergence - An Intelligence of the collective
        • Building Intelligent Machines
          • Part 1 - Define Intelligence
          • Part 2 - Observe and Respond
          • Part 3 - Training a Model
      • Updates
        • VideoDB Acquires Devzery: Expanding Our AI Infra Stack with Developer-First Testing Automation

Advanced Visual Search Pipelines


Let's deep dive into Scene and Frame objects

Scene

A Scene object describes a unique event in the video. From a timeline perspective it’s a timestamp range.
info
video_id : id of the video object
start : seconds
end : seconds
description : string description
Each scene object has an attribute frames, that has list of Frame objects.

Frame

Each Scene can be described by a list of frames. Each Frame object primarily has the URL of the image and its description field.
info
id : ID of the frame object
url : URL of the image
frame_time : Timestamp of the frame in the video
description : string description
video_id : id of the video object
scene_id : id of the scene object

Screenshot 2024-07-04 at 11.41.39 AM.jpg


We provide you with easy-to-use Objects and Functions to bring flexibility in designing your visual understanding pipeline. With these tools, you have the freedom to:
Extract scene according to your use case.
Go to frame level abstraction.
Assign label, custom model description for each frame.
Use of multiple models, prompts for each scene or frame to convert information to text.
Send multiple frames to vision model for better temporal activity understanding.

extract_scenes()

This function accepts the extraction_type and extraction_config and returns a object, which keep information about all the extracted scene lists.
Checkout for more details.
scene_collection = video.extract_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 30, "select_frames": ["middle"]},
)

Capture Temporal Change

Vision models excel at describing images, but videos present an added complexity due to the temporal changes in the information. With our pipeline, you can maintain image-level understanding in frames and combine them using LLMs at the scene level to capture temporal or activity-related understanding.
You have freedom to iterate through each scene and frame level to describe the information for indexing purposes.

Get scene collection
scene_collection = video.get_scene_collection("scene_collection_id")

Iterate through each scene and frame

Iterate over scenes and frames and attach description coming from external pipeline be it custom CV pipeline or custom model descriptions.
print("This is scene collection id", scene_collection.id)
print("This is scene collection config", scene_collection.config)

# get scene from collection
scenes = scene_collection.scenes

# Iterate through each scene
for scene in scenes:
print(f"Scene Duration {scene.start}-{scene.end}")
# Iterate through each frame in the scene
for frame in scene.frames:
print(f"Frame at {frame.frame_time} {frame.url}")
frame.description = "bring text from external sources/ pipeline"
)

Create Scene by custom annotation

These annotations can come from your application or from external vision model, if you extract the description using any vision LLM
for scene in scenes:
scene.description = "summary of frame level description"

Using this pipeline, you have the freedom to design your own flow. In the example above, we’ve described each frame in the scene independently, but some vision models allow multiple images in one go as well. Feel free to customise your flow as per your needs.
Experiment with sending multiple frames to a vision model.
Utilize prompts to describe multiple frames, then assign these descriptions to the scene.
Integrate your own vision model into the pipeline.

light
We’ll soon be adding more details and strategies for effective and advanced multimodal search. We welcome your input on what strategies have worked best in your specific use cases
Here’s our 🎙️ channel where we brainstorm about such ideas.

Once you have a description of each scene in place, you can index and search for the information using the following functions.
from videodb import IndexType

#create new index and assign a name to it
index_id = video.index_scenes(scenes=scenes, name="My Custom Model")

# search using the index_id
res = video.search(query="first 29 sec",
index_type=IndexType.scene,
index_id=index_id)

res.play()

 
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.