Home Gen AI News Talk Scaling data annotation using vision-language models to power physical AI systems

Gen AI News Talk

Scaling data annotation using vision-language models to power physical AI systems

February 23, 2026

Critical labor shortages are constraining growth across manufacturing, logistics, construction, and agriculture. The problem is particularly acute in construction: nearly 500,000 positions remain unfilled in the United States, with 40% of the current workforce approaching retirement within the decade. These workforce limitations result in delayed projects, escalating costs, and deferred development plans. To address these constraints, organizations are developing autonomous systems that can perform tasks that fill capacity gaps, extend operational capabilities, and offer the added benefit of around-the-clock productivity.

Building autonomous systems requires large, annotated datasets to train AI models. Effective training determines whether these systems deliver business value. The bottleneck: the high cost of data preparation. Critically, the act of labeling video data—identifying information about equipment, tasks, and the environment—is required to make sure that the data is useful for model training. This step can impede model deployment, which slows down the delivery of AI-powered products and services to customers. For construction companies managing millions of hours of video, manual data preparation and annotation become impractical. Vision-language models (VLMs) help to address this by interpreting images and video, responding to natural language queries, and generating descriptions at a speed and scale that manual processes cannot match, providing a cost-effective alternative.

In this post, we examine how Bedrock Robotics tackles this challenge. By joining the AWS Physical AI Fellowship, the startup partnered with the AWS Generative AI Innovation Center to apply vision-language models that analyze construction video footage, extract operational details, and generate labeled training datasets at scale, to improve data preparation for autonomous construction equipment.

Bedrock Robotics: a case study in accelerating autonomous construction

Since 2024, Bedrock Robotics has been developing autonomous systems for construction equipment. The company’s product, Bedrock Operator, is a retrofit solution that combines hardware with AI models to enable excavators and other machinery to operate with minimal human intervention. These systems can perform tasks like digging, grading, and material handling with centimeter-level precision. Training these models requires massive volumes of video footage capturing equipment, tasks, and the surrounding environment – a highly resource-intensive process that limits scalability.

VLMs offer a solution by analyzing this image and video data and generating text descriptions. This makes them well-suited for annotation tasks, which is critical for teaching models how to associate visual patterns with human language. Bedrock Robotics used this technology to streamline data preparation for training AI models, enabling autonomous operations for equipment. Additionally, through proper model selection and prompt engineering, the company improved tool identification from 34% to 70%. This transformed a manual, time-intensive process into an automated, scalable data pipeline solution. The breakthrough accelerated deployment of autonomous equipment.

This approach provides a replicable framework for organizations facing similar data challenges and demonstrates how strategic investment in foundation models (FMs) can deliver measurable operational outcomes and a competitive advantage. Foundation models are models trained on massive amounts of data using self-supervised learning techniques that learn general representations that can be adapted to many downstream tasks. VLMs leverage these large-scale pretraining techniques to bridge visual and textual modalities, enabling them to understand, analyze, and generate content across both image and language.

In the following sections, we look at the process that Bedrock Robotics used to annotate millions of hours of video footage and accelerate innovation using a VLM-based solution.

From unstructured video data to a strategic asset using VLMs

Enabling autonomous construction equipment requires extracting useful information from millions of hours of unstructured operational footage. Specifically, Bedrock Robotics needed to identify tool attachments, tasks, and worksite conditions across diverse scenarios. The following images are example video frames from this dataset.

Construction equipment operates with multiple tool attachments, each requiring accurate classification to train reliable AI models. Working with the Innovation Center, Bedrock Robotics focused their innovation efforts by addressing a few critical tool categories: lifting hooks for material handling, hammers for concrete demolition, grading beams for surface leveling, and trenching buckets for narrow excavation.

These labels allow Bedrock Robotics to select relevant video segments and assemble training datasets that represent a variety of equipment configurations and operating conditions.

Accelerating AI deployment through strategic model optimization

Off-the-shelf VLMs (VLMs without prompt optimization) struggle with construction video data because they’re trained on web images, not operator footage from excavator cabins. They can’t handle unusual angles, equipment-specific visuals, or poor visibility from dust and weather. They also lack the domain knowledge to distinguish visually similar tools like digging buckets from trenching buckets.

Bedrock Robotics and the Innovation Center addressed this through targeted model selection and prompt optimization. The teams evaluated multiple VLMs—including open source options and FMs available in Amazon Bedrock—then refined prompts with detailed visual descriptions of each tool, guidance for commonly confused tool pairs, and step-by-step instructions for analyzing video frames.

These modifications enhanced the classification accuracy from 34% to 70% on a test set comprising 130 videos, at $10 per hour of video processing. These results demonstrate how prompt engineering adapts VLMs to specialized tasks. For Bedrock Robotics, this customization delivered faster training cycles, reduced time-to-deployment, and a cost-effective scalable annotation pipeline that evolves with operational needs.

The path forward: addressing labor shortages through automation

The Competitive Advantage. For Bedrock Robotics, vision-language systems enabled rapid identification and extraction of critical datasets, providing necessary insights from massive construction video footage. With an overall accuracy of 70%, this cost-effective approach provides a practical foundation for scaling data preparation for model training. It demonstrates how strategic AI innovation can transform workforce constraints and accelerate industry transformations. Organizations that streamline data preparation can accelerate autonomous system deployment, reduce operational costs, and explore new areas for growth in industries impacted by labor shortages. With this repeatable framework, manufacturing and industrial automation leaders facing similar challenges can apply these principles to drive competitive differentiation within their own domains.

To learn more, visit Bedrock Robotics or explore the physical AI resources on AWS.