Main Advantage of VideoDB:
It’s a complete video infrastructure where you don’t have to maintain anything else, or work with video files directly. You can organize videos into collection of videos to segregate and manage. You can also upload audio and images in VideoDB. Provides search across videos on collections of videos. semantic, scene and keyword search with tweaking parameters Better control over keyframe detection algorithms. Much more sophisticated vision understanding using cutting edge vision models. Ability to bring your own LLM for vision understanding. Ability to create multimodal search queries. Programmable video streams to create clips, compilation of videos etc.
Can Azure media indexer search across all indexed video files?
No, You’ll have to create and maintain your own search infrastructure. It only provides information in json format with a standard label. VideoDB has semantic search built in with parameters to tweak the accuracy and recall.
Does Azure offer any kind of multimodal search?
No, It only extract insights in json. VideoDB can provide the multimodal late fusion search API.
Does it provide Video Answers ?
Azure media indexer doesn’t offer video answers or video stream answers. VideoDB’s search results would have parts of video stream with exact moments. It can easily embed into any application.
How does VideoDB compare against standard and advanced spoken and vision index of Azure media indexer?
Spoken :
Standard : timestamps are not word level. Provides keywords and topics which are generic and prone to false positives as observed in our analysis. Advanced : Different model, but same analysis as standard. VideoDB : Word level transcript, semantic and keyword based search. Easy to build pipelines for NLP ( keyword and topics ) analysis. For example beeping curse words etc.
Vision
Azure’s vision indexer runs on object and label detection type of model, VideoDB uses vision models to describe the frames. Vision model based understanding of frames is more advanced compared to label based understanding. VideoDB has freedom to choose keyframe extraction algorithms. Users also have choice to choose any vision model to describe the frames, setup prompts used for describing the frames. Multimodal
Using late fusion, multimodal search queries like - “show me where the therapist asked to raise hands and kid raise the hands” are possible. Azure Media Indexer can’t solve such queries out of the box.
Example chat app built on spoken Index