Llama 3.1 405B Inference Service Deployment: Beginner's Guide
Introduction
This article takes the 8 x H100 GPU instance to show how to deploy Llama3.1–405B. Deploying large models yourself is a time-consuming and costly endeavour. To avoid such tedious and expensive work, you can consider using the Llama3.1–405B directly. Look for excellent model provider platforms within the AI industry and utilize them via the OpenAI API standard for inference services. They are user-friendly and their pay-as-you-go pricing model is cost-effective and manageable. We recommend the Llama3.1–405B inference API service provided by Novita AI.
Llama3.1–405B Deployment Requirements
All released Llama3.1 models by Meta include 8B, 70B, and 405B. The Llama3.1–405B model is the largest open-source large language model to date with parameters reaching 405 billion. The model’s performance evaluation results outperform GPT-4 and GPT-4o and are on par with Claude3.5-Sonnet.
It is hard to load such a large model onto a GPU. The original FP16 version of the 405B model requires 810GB of GPU memory, as shown in the image below. However, take the current most powerful available GPU H100 as an example, the 8-socket server cannot directly load this version of the model. It is necessary to quantize the FP16 version of the model and convert it to a low-precision representation, thereby reducing the memory requirements and successfully loading it onto the GPU.
The Llama3.1 models have a similar architecture to the Llama3 models, which makes it easy to the inference framework. Open-source inference solutions like vLLM can be quickly adapted to support the large Llama3.1–405B.
To deploy an inference service for the Llama3.1–405B, the three key preparations are:
- Hardware: We recommend renting the 8 x H100 GPU server instance and reserving around 1.5TB of storage space.
- Model: Prepare a Hugging Face account and download the original Llama3.1–405B or the FP8 or INT4 versions of the model.
- Inference Framework: Download the latest version of vLLM v0.5.3.post1.
Model Preparation
After preparing an 8-socket H100 GPU server, log in to the server and download the model. We introduce how to download the FP16 version of the model and convert it to FP8 and INT4 quantized versions (of course, you can also download the pre-quantized models from Hugging Face).
We recommend downloading the Instruct version from the Hugging Face platform. First, register and log in to Hugging Face. Create and save the current user’s Access Token on the Setting page, which will be used when downloading the model.
Open the Llama3.1–405B model page: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct, submit a model application to Meta, and wait about half an hour for authorization.
Back on the GPU server, install the Hugging Face client application and start downloading the model. The command is as follows:
pip install huggingface-hub
huggingface-cli login ## Input Access Token
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct ## Start Downloading 405B
After a long wait (depending on the internet), the 800GB 405B model will be downloaded to the local machine. Call “huggingface-cli scan-cache” to view the detailed model information.
Next, we will do FP8 and INT4 quantization on the original model.
First, download the FP8 quantization tool. You can use the open-source AutoFP8: https://github.com/neuralmagic/AutoFP8. Call the scripts in it to quantize with the dynamic mode. Do this as follows:
git clone https://github.com/neuralmagic/AutoFP8.git
cd AutoFP8
pip install -e . ## Compile and install the AutoFP8 tool locally
python3 examples/quantize.py --model-id meta-llama/Meta-Llama-3.1-405B-Instruct --save-dir Meta-Llama-3-8B-Instruct-fp8 --activation-scheme dynamic --max-seq-len 2048 --num-samples 2048
The quantization script specifies the output path for the quantized model version using “ — save-dir”. Since this is a weight-only quantization, the overall speed is fast.
You can also download the FP8 quantized version provided by Meta:https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8. The download command is as follows:
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
Like FP8 quantization, you can use the open-source AutoAWQ tool to perform INT4 quantization. The address for this quantization method is https://github.com/casper-hansen/AutoAWQ. The quantization steps are as follows:
Building the Inference Framework
Leveraging the powerful open-source community, vLLM is an efficient inference framework providing support for various LLMs with timely updates. From v0.5.3.post1, vLLM supports the inference service for the Llama 3.1 models.
The steps to download the vLLM source code and compile it are as follows:
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout -b v0.5.3.post1 v0.5.3.post1
pip install -e .
After the compilation, you can start the inference service.
In addition to compiling vLLM on the local machine, you can build vLLM as a docker image. The method is as follows:
git clone https://github.com/vllm-project/vllm
cd vllm
git checkout -b v0.5.3.post1 v0.5.3.post1
docker build -t vllm_0.5.3.p1 .
You can also download the vLLM image:
docker pull vllm/vllm-openai:v0.5.3.post1
Running the Inference Service
After building the vLLM inference framework, you can start the vLLM and load the Llama 3.1–405B model to perform the inference service. For the local compilation of the vLLM inference framework, you need to switch to the root directory of the vLLM source code and then run the following command to start the inference service:
cd vllm
python3 -m vllm.entrypoints.openai.api_server --port 18001 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --pipeline-parallel-size 1 --swap-space 16 --gpu-memory-utilization 0.99 --dtype auto --served-model-name llama31-405b-fp8 --max-num-seqs 32 --max-model-len 32768 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768
If you want to run the inference service by a Docker container, you can use the following command:
docker run -d --gpus all --privileged --ipc=host --net=host vllm/vllm-openai:v0.5.3.post1 --port 18001 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --pipeline-parallel-size 1 --swap-space 16 --gpu-memory-utilization 0.99 --dtype auto --served-model-name llama31-405b-fp8 --max-num-seqs 32 --max-model-len 32768 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768
Once the vLLM inference framework is running, it will listen on port 18001 while receiving and processing user requests. You can quickly verify the inference service by sending a POST request to the completions endpoint from your local terminal. The command is as follows:
curl -X POST -H "Content-Type: application/json"
http://localhost:18001/v1/completions
-d '{"model": "llama31-405b-fp8", "prompt": "San Francisco is a"}'
With this, the deployment of the Llama 3.1 model is now complete.
Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.