Deploying LLMs in Clusters #1: running “vLLM production-stack” on a cloud VM

By LMCache Team


TL;DR

  • vLLM boasts the largest open-source community in LLM serving, and “vLLM production-stack” offers a vLLM-based full inference stack with 10x better performance and Easy cluster management.
  • Today, we will give a step-by-step demonstration on how to deploy a proof-of-concept “vLLM production-stack” in a cloud VM.
  • This is the beginning of our Deploying LLMs in Clusters series. We will be rolling out more blogs about serving LLMs with your own infrastructure during the next few weeks. Let us know which topic we should do next! [poll]
[Github Link] [More Tutorials] [Interest Form]

Tutorial Video (click below)

IMAGE ALT TEXT HERE

The Context

vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments.

vLLM Production-stack is an open-source reference implementation of an inference stack built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths.

Icon


vLLM production-stack offers superior performance than other LLM serving solutions by achieving higher throughput through smart routing and KV cache sharing:

Icon

Deploying a “production-stack” demo in the cloud

In this section, we will go through the general steps to set up the vLLM production-stack service in the cloud. If you prefer watching videos, please follow this tutorial video.

Step 1: Prepare the machine

In this example, we use a Lambda Labs GPU instance with an A40 GPU, but you can do the same thing with AWS EKS.

Install Kubernetes

  1. Clone the repository and navigate to the utils/ folder:

    git clone https://github.com/vllm-project/production-stack.git
    cd production-stack/utils
    
  2. Execute the script install-kubectl.sh:

    bash install-kubectl.sh
    

    This script downloads the latest version of kubectl, the Kubernetes command-line tool, and places it in your PATH for easy execution.

  3. Expected Output:
    • Verification message using:

      kubectl version --client
      

    Example output:

    Client Version: v1.32.1
    

    Install Helm

  4. Execute the script install-helm.sh:

    bash install-helm.sh
    

    This script downloads and installs Helm and places the Helm binary in your PATH. Helm is a package manager for Kubernetes and simplifies the deployment process

  5. Expected Output:
    • Verification message using:

      helm version
      

    Example output:

    version.BuildInfo{Version:"v3.17.0", GitCommit:"301108edc7ac2a8ba79e4ebf5701b0b6ce6a31e4", GitTreeState:"clean", GoVersion:"go1.23.4"}
    

Step 3: Install Minikube

  1. Execute the script install-minikube-cluster.sh:

    bash install-minikube-cluster.sh
    

    This script installs Minikube. Minikube configures the system to support GPU workloads by enabling the NVIDIA Container Toolkit and starting Minikube with GPU support.

  2. Expected Output:

    😄  minikube v1.35.0 on Ubuntu 22.04 (kvm/amd64)
    ❗  minikube skips various validations when --force is supplied; this may lead to unexpected behavior
    ✨  Using the docker driver based on user configuration
    ......
    ......
    🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
    "nvidia" has been added to your repositories
    Hang tight while we grab the latest from your chart repositories...
    ......
    ......
    NAME: gpu-operator-1737507918
    LAST DEPLOYED: Wed Jan 22 01:05:21 2025
    NAMESPACE: gpu-operator
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    

Step 2: Deploy the production stack with two vLLM instances and a router

Now you have everything needed, it is time to deploy production-stack!

First create a yaml file as shown below. Be sure to include your model name, model url, replica-count, and vLLM configurations in the file. Note that the pvcStorage needs be be bigger than the model size.

example.yaml

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
  - name: "opt125m"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "facebook/opt-125m"

    replicaCount: 2

    requestCPU: 6
    requestMemory: "16Gi"
    requestGPU: 0.5

    pvcStorage: "10Gi"
    pvcAccessMode:
      - ReadWriteMany

    vllmConfig:
      maxModelLen: 1024
      extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.4"]

And deploy this configuration using Helm:

helm repo add vllm https://vllm-project.github.io/production-stack
helm install mystack vllm/vllm-stack -f example.yaml

Monitor the deployment status using:

sudo kubectl get pods

Expected output:

  • Pods for the vllm deployment should transition to Ready and the Running state.
NAME                                               READY   STATUS    RESTARTS   AGE
mystack-deployment-router-85d4ffc696-xkg67         1/1     Running   0          2m38s
mystack-opt125m-deployment-vllm-858f4894fc-hfcgg   1/1     Running   0          2m38s
mystack-opt125m-deployment-vllm-858f4894fc-nt6sl   1/1     Running   0          2m38s

Note: It may take some time for the vLLM instance to become ready!

Step 3: Send requests and test!

3.1: Forward the Service Port

Expose the mystack-router-service port to the host machine:

sudo kubectl port-forward svc/mystack-router-service 30080:80

3.2: Query the OpenAI-Compatible API to list the available models

Test the stack’s OpenAI-compatible API by querying the available models:

curl -o- http://localhost:30080/models

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "facebook/opt-125m",
      "object": "model",
      "created": 1737428424,
      "owned_by": "vllm",
      "root": null
    }
  ]
}

3.3: Query the OpenAI Completion Endpoint

Send a query to the OpenAI /completion endpoint to generate a completion for a prompt:

curl -X POST http://localhost:30080/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "Once upon a time,",
    "max_tokens": 10
  }'

Expected output:

{
  "id": "completion-id",
  "object": "text_completion",
  "created": 1737428424,
  "model": "facebook/opt-125m",
  "choices": [
    {
      "text": " there was a brave knight who...",
      "index": 0,
      "finish_reason": "length"
    }
  ]
}

Step 4: Uninstall

To remove the deployment, run:

sudo helm uninstall mystack

Conclusion

We have demonstrated how to set up a vLLM Production Stack with a GPU VM.

This is the first episode of our Deploy LLMs in Clusters Series. Stay tuned for multi-node deployment on Amazon EKS, serving multiple models with one cluster, LLM router deep dive, and many more! Fill this one question poll to let us know which one we should do next!

Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat. Happy deploying!

Contacts:

Share: X (Twitter) Facebook LinkedIn