Track machine learning experiments with MLflow on Amazon SageMaker using Snowflake...
A user can conduct machine learning (ML) data experiments in data environments, such as Snowflake, using the Snowpark library. However, tracking these experiments across...
Governance by design: The essential guide for successful AI scaling
Picture this: Your enterprise has just deployed its first generative AI application. The initial results are promising, but as you plan to scale across...
Unlocking video understanding with TwelveLabs Marengo on Amazon Bedrock
Media and entertainment, advertising, education, and enterprise training content combines visual, audio, and motion elements to tell stories and convey information, making it far...
How Tata Power CoE built a scalable AI-powered solar panel inspection...
This post is co-written with Vikram Bansal from Tata Power, and Gaurav Kankaria, Omkar Dhavalikar from Oneture.
The global adoption of solar energy is rapidly...
Checkpointless training on Amazon SageMaker HyperPod: Production-scale training with faster fault...
Foundation model training has reached an inflection point where traditional checkpoint-based recovery methods are becoming a bottleneck to efficiency and cost-effectiveness. As models grow...
Adaptive infrastructure for foundation model training with elastic training on SageMaker...
Modern AI infrastructure serves multiple concurrent workloads on the same cluster, from foundation model (FM) pre-training and fine-tuning to production inference and evaluation. In...
NVIDIA Acquires Open-Source Workload Management Provider SchedMD
NVIDIA today announced it has acquired SchedMD — the leading developer of Slurm, an open-source workload management system for high-performance computing (HPC) and AI...
Applying data loading best practices for ML training with Amazon S3...
Amazon Simple Storage Service (Amazon S3) is a highly elastic service that automatically scales with application demand, offering the high throughput performance required for...
Operationalize generative AI workloads and scale to hundreds of use cases...
Enterprise organizations are rapidly moving beyond generative AI experiments to production deployments and complex agentic AI solutions, facing new challenges in scaling, security, governance,...
Customize agent workflows with advanced orchestration techniques using Strands Agents
Large Language Model (LLM) agents have revolutionized how we approach complex, multi-step tasks by combining the reasoning capabilities of foundation models with specialized tools...












