Home Gen AI News Talk Implement semantic video search using open source large vision models on Amazon...

Implement semantic video search using open source large vision models on Amazon SageMaker and Amazon OpenSearch Serverless

8

As companies and individual users deal with constantly growing amounts of video content, the ability to perform low-effort search to retrieve videos or video segments using natural language becomes increasingly valuable. Semantic video search offers a powerful solution to this problem, so users can search for relevant video content based on textual queries or descriptions. This approach can be used in a wide range of applications, from personal photo and video libraries to professional video editing, or enterprise-level content discovery and moderation, where it can significantly improve the way we interact with and manage video content.

Large-scale pre-training of computer vision models with self-supervision directly from natural language descriptions of images has made it possible to capture a wide set of visual concepts, while also bypassing the need for labor-intensive manual annotation of training data. After pre-training, natural language can be used to either reference the learned visual concepts or describe new ones, effectively enabling zero-shot transfer to a diverse set of computer vision tasks, such as image classification, retrieval, and semantic analysis.

In this post, we demonstrate how to use large vision models (LVMs) for semantic video search using natural language and image queries. We introduce some use case-specific methods, such as temporal frame smoothing and clustering, to enhance the video search performance. Furthermore, we demonstrate the end-to-end functionality of this approach by using both asynchronous and real-time hosting options on Amazon SageMaker AI to perform video, image, and text processing using publicly available LVMs on the Hugging Face Model Hub. Finally, we use Amazon OpenSearch Serverless with its vector engine for low-latency semantic video search.

About large vision models

In this post, we implement video search capabilities using multimodal LVMs, which integrate textual and visual modalities during the pre-training phase, using techniques such as contrastive multimodal representation learning, Transformer-based multimodal fusion, or multimodal prefix language modeling (for more details, see, Review of Large Vision Models and Visual Prompt Engineering by J. Wang et al.). Such LVMs have recently emerged as foundational building blocks for various computer vision tasks. Owing to their capability to learn a wide variety of visual concepts from massive datasets, these models can effectively solve diverse downstream computer vision tasks across different image distributions without the need for fine-tuning. In this section, we briefly introduce some of the most popular publicly available LVMs (which we also use in the accompanying code sample).

The CLIP (Contrastive Language-Image Pre-training) model, introduced in 2021, represents a significant milestone in the field of computer vision. Trained on a collection of 400 million image-text pairs harvested from the internet, CLIP showcased the remarkable potential of using large-scale natural language supervision for learning rich visual representations. Through extensive evaluations across over 30 computer vision benchmarks, CLIP demonstrated impressive zero-shot transfer capabilities, often matching or even surpassing the performance of fully supervised, task-specific models. For instance, a notable achievement of CLIP is its ability to match the top accuracy of a ResNet-50 model trained on the 1.28 million images from the ImageNet dataset, despite operating in a true zero-shot setting without a need for fine-tuning or other access to labeled examples.

Following the success of CLIP, the open-source initiative OpenCLIP further advanced the state-of-the-art by releasing an open implementation pre-trained on the massive LAION-2B dataset, comprised of 2.3 billion English image-text pairs. This substantial increase in the scale of training data enabled OpenCLIP to achieve even better zero-shot performance across a wide range of computer vision benchmarks, demonstrating further potential of scaling up natural language supervision for learning more expressive and generalizable visual representations.

Finally, the set of SigLIP (Sigmoid Loss for Language-Image Pre-training) models, including one trained on a 10 billion multilingual image-text dataset spanning over 100 languages, further pushed the boundaries of large-scale multimodal learning. The models propose an alternative loss function for the contrastive pre-training scheme employed in CLIP and have shown superior performance in language-image pre-training, outperforming both CLIP and OpenCLIP baselines on a variety of computer vision tasks.

Solution overview

Our approach uses a multimodal LVM to enable efficient video search and retrieval based on both textual and visual queries. The approach can be logically split into an indexing pipeline, which can be carried out offline, and an online video search logic. The following diagram illustrates the pipeline workflows.

The indexing pipeline is responsible for ingesting video files and preprocessing them to construct a searchable index. The process begins by extracting individual frames from the video files. These extracted frames are then passed through an embedding module, which uses the LVM to map each frame into a high-dimensional vector representation containing its semantic information. To account for temporal dynamics and motion information present in the video, a temporal smoothing technique is applied to the frame embeddings. This step makes sure the resulting representations capture the semantic continuity across multiple subsequent video frames, rather than treating each frame independently (also see the results discussed later in this post, or consult the following paper for more details). The temporally smoothed frame embeddings are then ingested into a vector index data structure, which is designed for efficient storage, retrieval, and similarity search operations. This indexed representation of the video frames serves as the foundation for the subsequent search pipeline.

The search pipeline facilitates content-based video retrieval by accepting textual queries or visual queries (images) from users. Textual queries are first embedded into the shared multimodal representation space using the LVM’s text encoding capabilities. Similarly, visual queries (images) are processed through the LVM’s visual encoding branch to obtain their corresponding embeddings.

After the textual or visual queries are embedded, we can build a hybrid query to account for keywords or filter constraints provided by the user (for example, to search only across certain video categories, or to search within a particular video). This hybrid query is then used to retrieve the most relevant frame embeddings based on their conceptual similarity to the query, while adhering to any supplementary keyword constraints.

The retrieved frame embeddings are then subjected to temporal clustering (also see the results later in this post for more details), which aims to group contiguous frames into semantically coherent video segments, thereby returning an entire video sequence (rather than disjointed individual frames).

Furthermore, maintaining search diversity and quality is crucial when retrieving content from videos. As mentioned previously, our approach incorporates various methods to enhance search results. For example, during the video indexing phase, the following techniques are employed to control the search results (the parameters of which might need to be tuned to get the best results):

  • Adjusting the sampling rate, which determines the number of frames embedded from each second of video. Less frequent frame sampling might make sense when working with longer videos, whereas more frequent frame sampling might be needed to catch fast-occurring events.
  • Modifying the temporal smoothing parameters to, for example, remove inconsistent search hits based on just a single frame hit, or merge repeated frame hits from the same scene.

During the semantic video search phase, you can use the following methods:

  • Applying temporal clustering as a post-filtering step on the retrieved timestamps to group contiguous frames into semantically coherent video clips (that can be, in principle, directly played back by the end-users). This makes sure the search results maintain temporal context and continuity, avoiding disjointed individual frames.
  • Setting the search size, which can be effectively combined with temporal clustering. Increasing the search size makes sure the relevant frames are included in the final results, albeit at the cost of higher computational load (see, for example, this guide for more details).

Our approach aims to strike a balance between retrieval quality, diversity, and computational efficiency by employing these techniques during both the indexing and search phases, ultimately enhancing the user experience in semantic video search.

The proposed solution architecture provides efficient semantic video search by using open source LVMs and AWS services. The architecture can be logically divided into two components: an asynchronous video indexing pipeline and online content search logic. The accompanying sample code on GitHub showcases how to build, experiment locally, as well as host and invoke both parts of the workflow using several open source LVMs available on the Hugging Face Model Hub (CLIP, OpenCLIP, and SigLIP). The following diagram illustrates this architecture.

The pipeline for asynchronous video indexing is comprised of the following steps:

  1. The user uploads a video file to an Amazon Simple Storage Service (Amazon S3) bucket, which initiates the indexing process.
  2. The video is sent to a SageMaker asynchronous endpoint for processing. The processing steps involve:
    • Decoding of frames from the uploaded video file.
    • Generation of frame embeddings by LVM.
    • Application of temporal smoothing, accounting for temporal dynamics and motion information present in the video.
  3. The frame embeddings are ingested into an OpenSearch Serverless vector index, designed for efficient storage, retrieval, and similarity search operations.

SageMaker asynchronous inference endpoints are well-suited for handling requests with large payloads, extended processing times, and near real-time latency requirements. This SageMaker capability queues incoming requests and processes them asynchronously, accommodating large payloads and long processing times. Asynchronous inference enables cost optimization by automatically scaling the instance count to zero when there are no requests to process, so computational resources are used only when actively handling requests. This flexibility makes it an ideal choice for applications involving large data volumes, such as video processing, while maintaining responsiveness and efficient resource utilization.

OpenSearch Serverless is an on-demand serverless version for Amazon OpenSearch Service. We use OpenSearch Serverless as a vector database for storing embeddings generated by the LVM. The index created in the OpenSearch Serverless collection serves as the vector store, enabling efficient storage and rapid similarity-based retrieval of relevant video segments.

The online content search then can be broken down to the following steps:

  1. The user provides a textual prompt or an image (or both) representing the desired content to be searched.
  2. The user prompt is sent to a real-time SageMaker endpoint, which results in the following actions:
    • An embedding is generated for the text or image query.
    • The query with embeddings is sent to the OpenSearch vector index, which performs a k-nearest neighbors (k-NN) search to retrieve relevant frame embeddings.
    • The retrieved frame embeddings undergo temporal clustering.
  3. The final search results, comprising relevant video segments, are returned to the user.

SageMaker real-time inference suits workloads needing real-time, interactive, low-latency responses. Deploying models to SageMaker hosting services provides fully managed inference endpoints with automatic scaling capabilities, providing optimal performance for real-time requirements.

Code and environment

This post is accompanied by a sample code on GitHub that provides comprehensive annotations and code to set up the necessary AWS resources, experiment locally with sample video files, and then deploy and run the indexing and search pipelines. The code sample is designed to exemplify best practices when developing ML solutions on SageMaker, such as using configuration files to define flexible inference stack parameters and conducting local tests of the inference artifacts before deploying them to SageMaker endpoints. It also contains guided implementation steps with explanations and reference for configuration parameters. Additionally, the notebook automates the cleanup of all provisioned resources.

Prerequisites

The prerequisite to run the provided code is to have an active AWS account and set up Amazon SageMaker Studio. Refer to Use quick setup for Amazon SageMaker AI to set up SageMaker if you’re a first-time user and then follow the steps to open SageMaker Studio.

Deploy the solution

To start the implementation to clone the repository, open the notebook semantic_video_search_demo.ipynb, and follow the steps in the notebook.

In Section 2 of the notebook, install the required packages and dependencies, define global variables, set up Boto3 clients, and attach required permissions to the SageMaker AWS Identity and Access Management (IAM) role to interact with Amazon S3 and OpenSearch Service from the notebook.

In Section 3, create security components for OpenSearch Serverless (encryption policy, network policy, and data access policy) and then create an OpenSearch Serverless collection. For simplicity, in this proof of concept implementation, we allow public internet access to the OpenSearch Serverless collection resource. However, for production environments, we strongly suggest using private connections between your Virtual Private Cloud (VPC) and OpenSearch Serverless resources through a VPC endpoint. For more details, see Access Amazon OpenSearch Serverless using an interface endpoint (AWS PrivateLink).

In Section 4, import and inspect the config file, and choose an embeddings model for video indexing and corresponding embeddings dimension. In Section 5, create a vector index within the OpenSearch collection you created earlier.

To demonstrate the search results, we also provide references to a few sample videos that you can experiment with in Section 6. In Section 7, you can experiment with the proposed semantic video search approach locally in the notebook, before deploying the inference stacks.

In Sections 8, 9, and 10, we provide code to deploy two SageMaker endpoints: an asynchronous endpoint for video embedding and indexing and a real-time inference endpoint for video search. After these steps, we also test our deployed sematic video search solution with a few example queries.

Finally, Section 11 contains the code to clean up the created resources to avoid recurring costs.

Results

The solution was evaluated across a diverse range of use cases, including the identification of key moments in sports games, specific outfit pieces or color patterns on fashion runways, and other tasks in full-length films on the fashion industry. Additionally, the solution was tested for detecting action-packed moments like explosions in action movies, identifying when individuals entered video surveillance areas, and extracting specific events such as sports award ceremonies.

For our demonstration, we created a video catalog consisting of the following videos: A Look Back at New York Fashion Week: Men’s, F1 Insights powered by AWS, Amazon Air’s newest aircraft, the A330, is here, and Now Go Build with Werner Vogels – Autonomous Trucking.

To demonstrate the search capability for identifying specific objects across this video catalog, we employed four text prompts and four images. The presented results were obtained using the google/siglip-so400m-patch14-384 model, with temporal clustering enabled and a timestamp filter set to 1 second. Additionally, smoothing was enabled with a kernel size of 11, and the search size was set to 20 (which were found to be good default values for shorter videos). The left column in the subsequent figures specifies the search type, either by image or text, along with the corresponding image name or text prompt used.

The following figure shows the text prompts we used and the corresponding results.

The following figure shows the images we used to perform reverse images search and corresponding search results for each image.

As mentioned, we implemented temporal clustering in the lookup code, allowing for the grouping of frames based on their ordered timestamps. The accompanying notebook with sample code showcases the temporal clustering functionality by displaying (a few frames from) the returned video clip and highlighting the key frame with the highest search score within each group, as illustrated in the following figure. This approach facilitates a convenient presentation of the search results, enabling users to return entire playable video clips (even if not all frames were actually indexed in a vector store).

To showcase the hybrid search capabilities with OpenSearch Service, we present results for the textual prompt “sky,” with all other search parameters set identically to the previous configurations. We demonstrate two distinct cases: an unconstrained semantic search across the entire indexed video catalog, and a search confined to a specific video. The following figure illustrates the results obtained from an unconstrained semantic search query.

We conducted the same search for “sky,” but now confined to trucking videos.

To illustrate the effects of temporal smoothing, we generated search signal score charts (based on cosine similarity) for the prompt F1 crews change tyres in the formulaone video, both with and without temporal smoothing. We set a threshold of 0.315 for illustration purposes and highlighted video segments with scores exceeding this threshold. Without temporal smoothing (see the following figure), we observed two adjacent episodes around t=35 seconds and two additional episodes after t=65 seconds. Notably, the third and fourth episodes were significantly shorter than the first two, despite exhibiting higher scores. However, we can do better, if our objective is to prioritize longer semantically cohesive video episodes in the search.

To address this, we apply temporal smoothing. As shown in the following figure, now the first two episodes appear to be merged into a single, extended episode with the highest score. The third episode experienced a slight score reduction, and the fourth episode became irrelevant due to its brevity. Temporal smoothing facilitated the prioritization of longer and more coherent video moments associated with the search query by consolidating adjacent high-scoring segments and suppressing isolated, transient occurrences.

Clean up

To clean up the resources created as part of this solution, refer to the cleanup section in the provided notebook and execute the cells in this section. This will delete the created IAM policies, OpenSearch Serverless resources, and SageMaker endpoints to avoid recurring charges.

Limitations

Throughout our work on this project, we also identified several potential limitations that could be addressed through future work:

  • Video quality and resolution might impact search performance, because blurred or low-resolution videos can make it challenging for the model to accurately identify objects and intricate details.
  • Small objects within videos, such as a hockey puck or a football, might be difficult for LVMs to consistently recognize due to their diminutive size and visibility constraints.
  • LVMs might struggle to comprehend scenes that represent a temporally prolonged contextual situation, such as detecting a point-winning shot in tennis or a car overtaking another vehicle.
  • Accurate automatic measurement of solution performance is hindered without the availability of manually labeled ground truth data for comparison and evaluation.

Summary

In this post, we demonstrated the advantages of the zero-shot approach to implementing semantic video search using either text prompts or images as input. This approach readily adapts to diverse use cases without the need for retraining or fine-tuning models specifically for video search tasks. Additionally, we introduced techniques such as temporal smoothing and temporal clustering, which significantly enhance the quality and coherence of video search results.

The proposed architecture is designed to facilitate a cost-effective production environment with minimal effort, eliminating the requirement for extensive expertise in machine learning. Furthermore, the current architecture seamlessly accommodates the integration of open source LVMs, enabling the implementation of custom preprocessing or postprocessing logic during both the indexing and search phases. This flexibility is made possible by using SageMaker asynchronous and real-time deployment options, providing a powerful and versatile solution.

You can implement semantic video search using different approaches or AWS services. For related content, refer to the following AWS blog posts as examples on semantic search using proprietary ML models: Implement serverless semantic search of image and live video with Amazon Titan Multimodal Embeddings or Build multimodal search with Amazon OpenSearch Service.


About the Authors

Dr. Alexander Arzhanov is an AI/ML Specialist Solutions Architect based in Frankfurt, Germany. He helps AWS customers design and deploy their ML solutions across the EMEA region. Prior to joining AWS, Alexander was researching origins of heavy elements in our universe and grew passionate about ML after using it in his large-scale scientific calculations.

Dr. Ivan Sosnovik is an Applied Scientist in the AWS Machine Learning Solutions Lab. He develops ML solutions to help customers to achieve their business goals.

Nikita Bubentsov is a Cloud Sales Representative based in Munich, Germany, and part of Technical Field Community (TFC) in computer vision and machine learning. He helps enterprise customers drive business value by adopting cloud solutions and supports AWS EMEA organizations in the computer vision area. Nikita is passionate about computer vision and the future potential that it holds.