Organizations are constantly seeking ways to harness the power of advanced large language models (LLMs) to enable a wide range of applications such as text generation, summarizationquestion answering, and many others. As these models grow more powerful and capable, deploying them in production environments while optimizing performance and cost-efficiency becomes more challenging.
Amazon Web Services (AWS) provides highly optimized and cost-effective solutions for deploying AI models, like the Mixtral 8x7B language model, for inference at scale. The AWS Inferentia and AWS Trainium are AWS AI chips, purpose-built to deliver high throughput and low latency inference and training performance for even the largest deep learning models. The Mixtral 8x7B model adopts the Mixture-of-Experts (MoE) architecture with eight experts. AWS Neuron—the SDK used to run deep learning workloads on AWS Inferentia and AWS Trainium based instances—employs expert parallelism for MoE architecture, sharding the eight experts across multiple NeuronCores.
This post demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference. We’ll walk through model compilation using Hugging Face Optimum Neuron, which provides a set of tools enabling straightforward model loading, training, and inference, and the Text Generation Inference (TGI) Container, which has the toolkit for deploying and serving LLMs with Hugging Face. This will be followed by deployment to an Amazon SageMaker real-time inference endpoint, which automatically provisions and manages the Inferentia2 instances behind the scenes and provides a containerized environment to run the model securely and at scale.
While pre-compiled model versions exist, we’ll cover the compilation process to illustrate important configuration options and instance sizing considerations. This end-to-end guide combines Amazon Elastic Compute Cloud (Amazon EC2)-based compilation with SageMaker deployment to help you use Mixtral 8x7B’s capabilities with optimal performance and cost efficiency.
Step 1: Set up Hugging Face access
Before you can deploy the Mixtral 8x7B model, there some prerequisites that you need to have in place.
- The model is hosted on Hugging Face and uses their transformers library. To download and use the model, you need to authenticate with Hugging Face using a user access token. These tokens allow secure access for applications and notebooks to Hugging Face’s services. You first need to create a Hugging Face account if you don’t already have one, which you can then use to generate and manage your access tokens through the user settings.
- The mistralai/Mixtral-8x7B-Instruct-v0.1 model that you will be working with in this post is a gated model. This means that you need to specifically request access from Hugging Face before you can download and work with the model.
Step 2: Launch an Inferentia2-powered EC2 Inf2 instance
To get started with an Amazon EC2 Inf2 instance for deploying the Mixtral 8x7B, either deploy the AWS CloudFormation template or use the AWS Management Console.
To launch an Inferentia2 instance using the console:
- Navigate to the Amazon EC2 console and choose Launch Instance.
- Enter a descriptive name for your instance.
- Under the Application and OS Images search for and select the Hugging Face Neuron Deep Learning AMI, which comes pre-configured with the Neuron software stack for AWS Inferentia.
- For Instance type, select 24xlarge, which contains six Inferentia chips (12 NeuronCores).
- Create or select an existing key pair to enable SSH access.
- Create or select a security group that allows inbound SSH connections from the internet.
- Under Configure Storage, set the root EBS volume to 512 GiB to accommodate the large model size.
- After the settings are reviewed, choose Launch Instance.
With your Inf2 instance launched, connect to it over SSH by first locating the public IP or DNS name in the Amazon EC2 console. Later in this post, you will connect to a Jupyter notebook using a browser on port 8888. To do that, SSH tunnel to the instance using the key pair you configured during instance creation.
After signing in, list the NeuronCores attached to the instance and their associated topology:
For inf2.24xlarge, you should see the following output listing six Neuron devices:
For more information on the neuron-ls
command, see the Neuron LS User Guide.
Make sure the Inf2 instance is sized correctly to host the model. Each Inferentia NeuronCore processor contains 16 GB of high-bandwidth memory (HBM). To accommodate an LLM like the Mixtral 8x7B on AWS Inferentia2 (inf2) instances, a technique called tensor parallelism is used. This allows the model’s weights, activations, and computations to be split and distributed across multiple NeuronCores in parallel. To determine the degree of tensor parallelism required, you need to calculate the total memory footprint of the model. This can be computed as:
total memory = bytes per parameter * number of parameters
The Mixtral-8x7B model consists of 46.7 billion parameters. With float16
casted weights, you need 93.4 GB to store the model weights. The total space required is often greater than just the model parameters because of caching attention layer projections (KV caching). This caching mechanism grows memory allocations linearly with sequence length and batch size. With a batch size of 1 and a sequence length of 1024 tokens, the total memory footprint for the caching is 0.5 GB. The exact formula can be found in the AWS Neuron documentation and the hyper-parameter configuration required for these calculations is stored in the model config.json file.
Given that each NeuronCore has 16 GB of HBM, and the model requires approximately 94 GB of memory, a minimum tensor parallelism degree of 6 would theoretically suffice. However, with 32 attention heads, the tensor parallelism degree must be a divisor of this number.
Furthermore, considering the model’s size and the MoE implementation in transformers-neuronx
, the supported tensor parallelism degrees are limited to 8, 16, and 32. For the example in this post, you will distribute the model across eight NeuronCores.
Compile Mixtral-8x7B model to AWS Inferentia2
The Neuron SDK includes a specialized compiler that automatically optimizes the model format for efficient execution on AWS Inferentia2.
- To start this process, launch the container and pass the Inferentia devices to the container. For more information about launching the neuronx-tgi container see Deploy the Text Generation Inference (TGI) Container on a dedicated host.
- Inside the container, sign in to the Hugging Face Hub to access gated models, such as the Mixtral-8x7B-Instruct-v0.1. See the previous section for Setup Hugging Face Access. Make sure to use a token with read and write permissions so you can later save the compiled model to the Hugging Face Hub.
- After signing in, compile the model with optimum-cli. This process will download the model artifacts, compile the model, and save the results in the specified directory.
- The Neuron chips are designed to execute models with fixed input shapes for optimal performance. This requires that the compiled artifact shapes must be known at compilation time. In the following command, you will set the batch size, input/output sequence length, data type, and tensor-parallelism degree (number of neuron cores). For more information about these parameters, see Export a model to Inferentia.
Let’s discuss these parameters in more detail:
- The parameter
batch_size
is the number of input sequences that the model will accept. sequence_length
specifies the maximum number of tokens in an input sequence. This affects memory usage and model performance during inference or training on Neuron hardware. A larger number will increase the model’s memory requirements because the attention mechanism needs to operate over the entire sequence, which leads to more computations and memory usage; while a smaller number will do the opposite. The value 1024 will be adequate for this example.auto_cast_type
parameter controls quantization. It allows type casting for model weights and computations during inference. The options are:bf16
,fp16
, ortf32
. For more information about defining which lower-precision data type the compiler should use see Mixed Precision and Performance-accuracy Tuning. For models trained in float32, the 16-bit mixed precision options (bf16
,f16
) generally provide sufficient accuracy while significantly improving performance. We use data typefloat16
with the argumentauto_cast_type fp16
.- The
num_cores
parameter controls the number of cores on which the model should be deployed. This will dictate the number of parallel shards or partitions the model is split into. Each shard is then executed on a separate NeuronCore, taking advantage of the 16 GB high-bandwidth memory available per core. As discussed in the previous section, given the Mixtral-8x7B model’s requirements, Neuron supports 8, 16, or 32 tensor parallelism The inf2.24xlarge instance contains 12 Inferentia NeuronCores. Therefore, to optimally distribute the model, we setnum_cores
to 8.
- Download and compilation should take 10–20 minutes. After the compilation completes successfully, you can check the artifacts created in the output directory:
- Push the compiled model to the Hugging Face Hub with the following command. Make sure to change
<user_id>
to your Hugging Face username. If the model repository doesn’t exist, it will be created automatically. Alternatively, store the model on Amazon Simple Storage Service (Amazon S3).
huggingface-cli upload <user_id>/Mixtral-8x7B-Instruct-v0.1 ./neuron_model_path ./
Deploy Mixtral-8x7B SageMaker real-time inference endpoint
Now that the model has been compiled and stored, you can deploy it for inference using SageMaker. To orchestrate the deployment, you will run Python code from a notebook hosted on an EC2 instance. You can use the instance created in the first section or create a new instance. Note that this EC2 instance can be of any type (for example t2.micro
with an Amazon Linux 2023 image). Alternatively, you can use a notebook hosted in Amazon SageMaker Studio.
Set up AWS authorization for SageMaker deployment
You need AWS Identity and Access Management (IAM) permissions to manage SageMaker resources. If you created the instance with the provided CloudFormation template, these permissions are already created for you. If not, the following section takes you through the process of setting up the permissions for an EC2 instance to run a notebook that deploys a real-time SageMaker inference endpoint.
Create an AWS IAM role and attach SageMaker permission policy
- Go to the IAM console.
- Choose the Roles tab in the navigation pane.
- Choose Create role.
- Under Select trusted entity, select AWS service.
- Choose Use case and select EC2.
- Select EC2 (Allows EC2 instances to call AWS services on your behalf.)
- Choose Next: Permissions.
- In the Add permissions policies screen, select AmazonSageMakerFullAccess and IAMReadOnlyAccess. Note that the AmazonSageMakerFullAccess permission is overly permissive. We use it in this example to simplify the process but recommend applying the principle of least privilege when setting up IAM permissions.
- Choose Next: Review.
- In the Role name field, enter a role name.
- Choose Create role to complete the creation.
- With the role created, choose the Roles tab in the navigation pane and select the role you just created.
- Choose the Trust relationships tab and then choose Edit trust policy.
- Choose Add next to Add a principal.
- For Principal type, select AWS services.
- Enter
sagemaker.amazonaws.com
and choose Add a principal. - Choose Update policy. Your trust relationship should look like the following:
Attach the IAM role to your EC2 instance
- Go to the Amazon EC2 console.
- Choose Instances in the navigation pane.
- Select your EC2 instance.
- Choose Actions, Security, and then Modify IAM role.
- Select the role you created in the previous step.
- Choose Update IAM role.
Launch a Jupyter notebook
Your next goal is to run a Jupyter notebook hosted in a container running on the EC2 instance. The notebook will be run using a browser on port 8888 by default. For this example, you will use SSH port forwarding from your local machine to the instance to access the notebook.
- Continuing from the previous section, you are still within the container. The following steps install Jupyter Notebook:
- Launch the notebook server using:
- Then connect to the notebook using your browser over SSH tunneling
http://localhost:8888/tree?token=…
If you get a blank screen, try opening this address using your browser’s incognito mode.
Deploy the model for inference with SageMaker
After connecting to Jupyter Notebook, follow this notebook. Alternatively, choose File, New, Notebook, and then select Python 3 as the kernel. Use the following instructions and run the notebook cells.
- In the notebook, install the
sagemaker
andhuggingface_hub
libraries.
- Next, get a SageMaker session and execution role that will allow you to create and manage SageMaker resources. You’ll use a Deep Learning Container.
- Deploy the compiled model to a SageMaker real-time endpoint on AWS Inferentia2.
Change user_id
in the following code to your Hugging Face username. Make sure to update HF_MODEL_ID
and HUGGING_FACE_HUB_TOKEN
with your Hugging Face username and your access token.
- You’re now ready to deploy the model to a SageMaker real-time inference endpoint. SageMaker will provision the necessary compute resources instance and retrieve and launch the inference container. This will download the model artifacts from your Hugging Face repository, load the model to the Inferentia devices and start inference serving. This process can take several minutes.
- Next, run a test to check the endpoint. Update
user_id
to match your Hugging Face username, then create the prompt and parameters.
- Send the prompt to the SageMaker real-time endpoint for inference
- In the future, if you want to connect to this inference endpoint from other applications, first find the name of the inference endpoint. Alternatively, you can use the SageMaker console and choose Inference, and then Endpoints to see a list of the SageMaker endpoints deployed in your account.
- Use the endpoint name to update the following code, which can also be run in other locations.
Cleanup
Delete the endpoint to prevent future charges for the provisioned resources.
Conclusion
In this post, we covered how to compile and deploy the Mixtral 8x7B language model on AWS Inferentia2 using the Hugging Face Optimum Neuron container and Amazon SageMaker. AWS Inferentia2 offers a cost-effective solution for hosting models like Mixtral, providing high-performance inference at a lower cost.
For more information, see Deploy Mixtral 8x7B on AWS Inferentia2 with Hugging Face Optimum.
For other methods to compile and run Mixtral inference on Inferentia2 and Trainium see the Run Hugging Face mistralai/Mixtral-8x7B-v0.1 autoregressive sampling on Inf2 & Trn1 tutorial located in the AWS Neuron Documentation and Notebook.
About the authors
Lior Sadan is a Senior Solutions Architect at AWS, with an affinity for storage solutions and AI/ML implementations. He helps customers architect scalable cloud systems and optimize their infrastructure. Outside of work, Lior enjoys hands-on home renovation and construction projects.
Stenio de Lima Ferreira is a Senior Solutions Architect passionate about AI and automation. With over 15 years of work experience in the field, he has a background in cloud infrastructure, devops and data science. He specializes in codifying complex requirements into reusable patterns and breaking down difficult topics into accessible content.