Home Gen AI News Talk Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3

Accelerating LLM fine-tuning with unstructured data using SageMaker Unified Studio and S3

2

Last year, AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets. This integration makes it straightforward for teams to use unstructured data stored in Amazon Simple Storage Service (Amazon S3) for machine learning (ML) and data analytics use cases.

In this post, we show how to integrate S3 general purpose buckets with Amazon SageMaker Catalog to fine-tune Llama 3.2 11B Vision Instruct for visual question answering (VQA) using Amazon SageMaker Unified Studio. For this task, we provide our large language model (LLM) with an input image and question and receive an answer. For example, asking to identify the transaction date from an itemized receipt:

For this demonstration, we use Amazon SageMaker JumpStart to access the Llama 3.2 11B Vision Instruct model. Out of the box, this base model achieves an Average Normalized Levenshtein Similarity (ANLS) score of 85.3% on the DocVQA dataset. ANLS is a metric used to evaluate the performance of models on visual question answering tasks, which measures the similarity between the model’s predicted answer and the ground truth answer. While 85.3% demonstrates strong baseline performance, this level might not be the most efficient for tasks requiring a higher degree of accuracy and precision.

To improve model performance through fine-tuning, we’ll use the DocVQA dataset from Hugging Face. This dataset contains 39,500 rows of training data, each with an input image, a question, and a corresponding expected answer. We’ll create three fine-tuned model versions using varying dataset sizes (1,000, 5,000, and 10,000 images). We’ll then evaluate them using Amazon SageMaker fully managed serverless MLflow to track experimentation and measure accuracy improvements.

The full end-to-end data ingestion, model development, and metric evaluation process will be orchestrated using Amazon SageMaker Unified Studio. Here is the high-level process flow diagram that we’ll step through for this scenario. We’ll expand on this throughout the blog post.

To achieve this process flow, we build an architecture that performs the data ingestion, data preprocessing, model training, and evaluation using Amazon SageMaker Unified Studio. We break out each step in the following sections.

The Jupyter notebook used and referenced throughout this exercise can be found in this GitHub repository.

Prerequisites

To prepare your organization to use the new integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, you must complete the following prerequisites. Note that these steps take place on an Identity Center-based domain.

  1. Create an AWS account.
  2. Create an Amazon SageMaker Unified Studio domain using quick setup.
  3. Create two projects within the SageMaker Unified Studio domain to model the scenario in this post: one for the data producer persona and one for the data consumer persona. The first project is used for discovering and cataloging the dataset in an Amazon S3 bucket. The second project consumes the dataset to fine-tune three iterations of our large language model. See Create a project for additional information.
  4. Your data consumer project must have access to a running SageMaker managed MLflow serverless application, which will be used for experimentation and evaluation purposes. For more information, see the instructions for creating a serverless MLflow application.
  5. An Amazon S3 bucket should be pre-populated with the raw dataset to be used for your ML development use case. In this blog post, we use the DocVQA dataset from Hugging Face for fine-tuning a visual question answering (VQA) use case.
  6. A service quota increase request to use p4de.24xlarge compute for training jobs. See Requesting a quota increase for more information.

Architecture

The following is the reference architecture that we build throughout this post:

We can break the architecture diagram into a series of six high-level steps, which we’ll observe throughout the following sections:

  1. First, you create and configure an IAM access role that grants read permissions to a pre-existing Amazon S3 bucket containing the raw and unprocessed DocVQA dataset.
  2. The data producer project uses the access role to discover and add the dataset to the project catalog.
  3. The data producer project enriches the dataset with optional metadata and publishes it to the SageMaker Catalog.
  4. The data consumer project subscribes to the published dataset, making it available to the project team responsible for developing (or fine-tuning) the machine learning models.
  5. The data consumer project preprocesses the data and transforms it into three training datasets of varying sizes (1k, 5k, and 10k images). Each dataset is used to fine-tune our base large language model.
  6.  We use MLflow for tracking experimentation and evaluation results of the three models against our Average Normalized Levenshtein Similarity (ANLS) success metric.

Solution walkthrough

As mentioned previously, we will opt to use the DocVQA dataset from Hugging Face for a visual question answering task. In your organization’s scenario, this raw dataset might be any unstructured data relevant to your ML use case. Examples include customer support chat logs, internal documents, product reviews, legal contracts, research papers, social media posts, email archives, sensor data, and financial transaction records.

In the prerequisite section of our Jupyter notebook, we pre-populate our Amazon S3 bucket using the Datasets API from Hugging Face:

import os
from datasets import load_dataset

# Create data directory
os.makedirs("data", exist_ok=True)

# Load and save train split (first 10,000 rows)
train_data = load_dataset("HuggingFaceM4/DocumentVQA", split="train[:10000]", cache_dir="./data")
train_data.save_to_disk("data/train")

# Load and save validation split (first 100 rows)
val_data = load_dataset("HuggingFaceM4/DocumentVQA", split="validation[:100]", cache_dir="./data")
val_data.save_to_disk("data/validation")

After retrieving the dataset, we complete the prerequisite by synchronizing it to an Amazon S3 bucket. This represents the bucket depicted in the bottom-right section of our architecture diagram shown previously.

At this point, we’re ready to begin working with our data in Amazon SageMaker Unified Studio, starting with our data producer project. A project in Amazon SageMaker Unified Studio is a boundary within a domain where you can collaborate with others on a business use case. To bring Amazon S3 data into your project, you must first add access to the data and then add the data to your project. In this post, we follow the approach of using an access role to facilitate this process. See Adding Amazon S3 data for more information.

Once our access role is created following the instructions in the documentation referenced previously, we can continue with discovering and cataloging our dataset. In our data producer project, we navigate to the Data → Add data → Add S3 location:

Provide the name of the Amazon S3 bucket and corresponding prefix containing our raw data, and note the presence of the access role dropdown containing the prerequisite access role previously created:

Once added, note that we can now see our new Amazon S3 bucket in the project catalog as shown in the following image:

From the perspective of our data producer persona, the dataset is now available within our project context. Depending on your organization and requirements, you might want to further enrich this data asset. For example, you can join it with additional data sources, apply business-specific transformations, implement data quality checks, or create derived features through feature engineering pipelines. However, for the purposes of this post, we’ll work with the dataset in its current form to keep our focus on the core point of integrating Amazon S3 general purpose buckets with Amazon SageMaker Unified Studio.

We are now ready to publish this bucket to our SageMaker Catalog. We can add optional business metadata such as a README file, glossary terms, and other data types. We add a simple README, skip other metadata fields for brevity, and continue to publishing by choosing Publish to Catalog under the Actions menu.

At this point, we’ve added the data asset to our SageMaker Catalog and it is ready to be consumed by other projects in our domain. Switching over to the perspective of our data consumer persona and selecting the consumer project, we can now subscribe to our newly published data asset. See Subscribe to a data product in Amazon SageMaker Unified Studio for more information.

Now that we’ve subscribed to the data asset in our consumer project where we’ll build the ML model, we can begin using it within a managed JupyterLab IDE in Amazon SageMaker Unified Studio. The JupyterLab page of Amazon SageMaker Unified Studio provides a JupyterLab interactive development environment (IDE) for you to use as you perform data integration, analytics, or machine learning in your projects.

In our ML development project, navigate to the ComputeSpacesCreate space option, and choose JupyterLab in the Application (space type) menu to launch a new JupyterLab IDE.

Note that some models in our example notebook can take upwards of 4 hours to train using the ml.p4de.24xlarge instance type. As a result, we recommend that you set the Idle Time to 6 hours to allow the notebook to run to completion and avoid errors. Additionally, if executing the notebook from end to end for the first time, set the space storage to 100 GB to allow for the dataset to be fully ingested during the fine-tuning process. See Creating a new space for more information.

With our space created and running, we choose the Open button to launch the JupyterLab IDE. Once loaded, we upload the sample Jupyter notebook into our space using the Upload Files functionality.

Now that we’ve subscribed to the published dataset in our ML development project, we can begin the model development workflow. This involves three key steps: fetching the dataset from our bucket using Amazon S3 Access Grants, preparing it for fine-tuning, and training our models.

Grantees can access Amazon S3 data by using the AWS Command Line Interface (AWS CLI), the AWS SDKs, and the Amazon S3 REST API. Additionally, you can use the AWS Python and Java plugins to call Amazon S3 Access Grants. For brevity, we opt for the AWS CLI approach in the notebook and the following code. We also include a sample that shows the use of the Python boto3-s3-access-grants-plugin in the appendix section of the notebook for reference.

The process includes two steps: first obtaining temporary access credentials to the Amazon S3 control plane through the s3control CLI module, then using those credentials to sync the data locally. Update the AWS_ACCOUNT_ID variable with the appropriate account ID that houses your dataset.

import json

AWS_ACCOUNT_ID = "123456789" # REPLACE THIS WITH YOUR ACCOUNT ID
S3_BUCKET_NAME = "s3://MY_BUCKET_NAME/" # REPLACE THIS WITH YOUR BUCKET

# Get credentials
result = !aws s3control get-data-access --account-id {AWS_ACCOUNT_ID} --target {S3_BUCKET_NAME} --permission READ

json_response = json.loads(result.s)
creds = json_response['Credentials']

# Configure profile with cell magic
!aws configure set aws_access_key_id {creds['AccessKeyId']} --profile access-grants-consumer-access-profile
!aws configure set aws_secret_access_key {creds['SecretAccessKey']} --profile access-grants-consumer-access-profile
!aws configure set aws_session_token {creds['SessionToken']} --profile access-grants-consumer-access-profile

print("Profile configured successfully!")

!aws s3 sync {S3_BUCKET_NAME} ./ --profile access-grants-consumer-access-profile

After running the previous code and getting a successful output, we can now access the S3 bucket locally. With our raw dataset now accessible locally, we need to transform it into the format required for fine-tuning our LLM. We’ll create three datasets of varying sizes (1k, 5k, and 10k images) to evaluate how the dataset size impacts model performance.

Each training dataset contains a train and validation directory, each of which must contain an images subdirectory and accompanying metadata.jsonl file with training examples. The metadata file format includes three key/value fields per line:

{"file_name": "images/img_0.jpg", "prompt": "what is the date mentioned in this letter?", "completion": "1/8/93"}
{"file_name": "images/img_1.jpg", "prompt": "what is the contact person name mentioned in letter?", "completion": "P. Carter"}

With these artifacts uploaded to Amazon S3, we can now fine-tune our LLM by using SageMaker JumpStart to access the pre-trained Llama 3.2 11B Vision Instruct model. We’ll create three separate fine-tuned variants to evaluate. We’ve created a train() function to facilitate this using a parameterized approach, making this reusable for different dataset sizes:

def train(name, instance_type, training_data_path, experiment_name, run):
    ...
    estimator = JumpStartEstimator(
        model_id=model_id, model_version=model_version,
        environment={"accept_eula": "true"},  # Must accept as true
        disable_output_compression=True,
        instance_type=instance_type,
        hyperparameters=my_hyperparameters,
    )
    ...

Our training function handles several important aspects:

  • Model selection: Uses the latest version of Llama 3.2 11B Vision Instruct from SageMaker JumpStart.
  • Hyperparameters: The sample notebook uses the retrieve_default() API in the SageMaker SDK to automatically fetch the default hyperparameters for our model.
  • Batch size: The only default hyperparameter that we change, setting to 1 per device due to the large model size and memory constraints.
  • Instance type: We use a ml.p4de.24xlarge instance type for this training job and recommend that you use the same type or larger.
  • MLflow integration: Automatically logs hyperparameters, job names, and training metadata for experiment tracking.
  • Endpoint deployment: Automatically deploys each trained model to a SageMaker endpoint for inference.

Recall that the training process will take a few hours to complete using instance type ml.p4de.24xlarge.

Now we’ll evaluate our fine-tuned models using the Average Normalized Levenshtein Similarity (ANLS) metric. This metric evaluates text-based outputs by measuring the similarity between predicted and ground truth answers, even when there are minor errors or variations. It is particularly useful for tasks like visual question answering because it can handle slight variations in answers. See the Llama 3.2 3B model card for more information.

MLflow will track our experiments and results for straightforward comparison. Our evaluation pipeline includes several key functions for image encoding for model inference, payload formatting, ANLS calculation, and results tracking. The training_pipeline() function orchestrates the complete workflow with nested MLflow runs for better experiment organization.

# MLFlow configuration
arn = "" # replace with ARN of project's MLflow instance
mlflow.set_tracking_uri(arn)

def training_pipeline(training_size):
    # Set experiment
    experiment_name = f"docvqa-{training_size}"
    mlflow.set_experiment(experiment_name)
    
    # Start main run
    with mlflow.start_run(run_name="pipeline-run"):
        
        # DataPreprocess nested run
        with mlflow.start_run(run_name="DataPreprocess", nested=True):
            training_data_path = process_data("train", f"docvqa_{training_size}/train", training_size)
        
        # TrainDeploy nested run
        with mlflow.start_run(run_name="TrainDeploy", nested=True) as run:
            model_name = train(f"docvqa-{training_size}", "ml.p4d.24xlarge", training_data_path, experiment_name, run)
            #model_name = 'base-model'
        
        # Evaluate nested run
        with mlflow.start_run(run_name="Evaluate", nested=True):

            # Load validation data
            with open("./docvqa_1k/validation/metadata.jsonl") as f:
                data = [json.loads(line) for line in f]            
            
            print(f"nStarting validation for {model_name}")
            
            # Log parameters
            mlflow.log_param("model_name", model_name)
            mlflow.log_param("total_images", len(data[:50]))
            mlflow.log_param("threshold", 0.5)

            predictor = retrieve_default(model_id="meta-vlm-llama-3-2-11b-vision-instruct", model_version="*", endpoint_name=model_name)
            
            results = []
            anls_scores = []
            
            # Process each image
            for i, each in enumerate(data[:50]):
                filename = each['file_name']
                question = each["prompt"]
                ground_truth = each["completion"]
                image_path = f"./docvqa_1k/validation/{filename}"
                
                print(f"Processing {filename} ({i+1}/50)")
                
                # Get model prediction using traced function
                inferred_response = invoke_model(predictor, question, image_path)
                
                # Calculate ANLS score
                anls_score = anls_metric_single(inferred_response, ground_truth)
                anls_scores.append(anls_score)
                
                # Store result
                result = {
                    'filename': filename,
                    'ground_truth': ground_truth,
                    'inferred_response': inferred_response,
                    'anls_score': anls_score
                }
                results.append(result)
                
                print(f"  Ground Truth: {ground_truth}")
                print(f"  Prediction: {inferred_response}")
                print(f"  ANLS Score: {anls_score:.4f}")
            
            # Calculate average ANLS score
            avg_anls = sum(anls_scores) / len(anls_scores) if anls_scores else 0.0
            
            # Log metrics
            mlflow.log_metric("average_anls_score", avg_anls)
            
            # Save results to CSV
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            csv_filename = f"anls_validation_{model_name}_{timestamp}.csv"
            save_results_to_csv(results, csv_filename)
            
            # Log CSV as artifact
            mlflow.log_artifact(csv_filename)
            
            print(f"Results for {model_name}:")
            print(f"  Average ANLS Score: {avg_anls:.4f}")
            
            mlflow.log_param("metric_type", "anls")
            mlflow.log_param("threshold", "0.5")

After orchestrating three end-to-end executions for our three dataset sizes, we review the ANLS metric results in MLflow. Using the comparison functionality, we note the highest ANLS score of 0.902 in the docvqa-10000 model, an increase of 4.9 percentage points relative to the base model (0.902 − 0.853 = 0.049).

Model ANLS
docvqa-1000 0.886
docvqa-5000 0.894
docvqa-10000 0.902
Base Model 0.853

Clean Up

To avoid ongoing charges, delete the resources created during this walkthrough. This includes SageMaker endpoints and project resources such as the MLflow application, JupyterLab IDE, and domain.

Conclusion

Based on the preceding data, we observe a positive relationship between the size of the training dataset and ANLS in that the docvqa-10000 model had improved performance.

We used MLflow for experimentation and visualization around our success metric. Further improvements in areas such as hyperparameter tuning and data enrichment could yield even better results.

This walkthrough demonstrates how the Amazon SageMaker Unified Studio integration with S3 general purpose buckets helps streamline the path from unstructured data to production-ready ML models. Key benefits include:

  • Simplified data discovery and cataloging through a unified interface
  • More secure data access through S3 Access Grants without complex permission management
  • Smooth collaboration between data producers and consumers across projects
  • End-to-end experiment tracking with managed MLflow integration

Organizations can now use their existing S3 data assets more effectively for ML workloads while maintaining governance and security controls. The 4.9% performance improvement from base model to our improved fine-tuned variant (0.853–0.902 ANLS) validates the approach for visual question answering tasks.

For next steps, consider exploring additional dataset preprocessing techniques, experimenting with different model architectures available through SageMaker JumpStart, or scaling to larger datasets as your use case demands.

The solution code used for this blog post can be found in this GitHub repository.


About the authors

Hazim Qudah

Hazim Qudah is an AI/ML Specialist Solutions Architect at Amazon Web Services. He enjoys helping customers build and adopt AI/ML solutions using AWS technologies and best practices. Prior to his role at AWS, he spent many years in technology consulting with customers across many industries and geographies. In his free time, he enjoys running and playing with his dogs Nala and Chai!