Skip to content

mostlygeek llama3.3 70B spec decoding dry

Benson Wong edited this page May 26, 2025 · 4 revisions

This configuration serves llama 3.3 70B with DRY sampling. It makes use of 3 GPUs:

  • llama 3.3 70B over dual 3090s
  • llama 3.2 3B on the P40 as the draft model

llama-server:

$ /mnt/nvme/llama-server/llama-server-latest --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5438 (fb1cab20)

VRAM usage when fully loaded

$ nvidia-smi --query-gpu=index,name,memory.used --format=csv | column -t -s,
index   name                      memory.used [MiB]
0       Tesla P40                 5841 MiB
1       Tesla P40                 149 MiB
2       NVIDIA GeForce RTX 3090   23559 MiB
3       NVIDIA GeForce RTX 3090   23293 MiB

llama-swap configuration:

models:
  "llama-70B-dry-draft":
    cmd: |
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port ${PORT} --flash-attn --metrics
      --ctx-size 32000
      --ctx-size-draft 32000
      
      # quantize the cache to make it fit 
      --cache-type-k q8_0 --cache-type-v q8_0

      # all layers fit on the GPUs 
      -ngl 99 -ngld 99
      --model /mnt/nvme/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf

      # split the main model evenly over the 3090s
      --tensor-split 1,1,0,0

      # draft settings, use the P40 for draft
      --draft-max 8 --draft-min 1 --draft-p-min 0.9 --device-draft CUDA2
      --model-draft /mnt/nvme/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
      --dry-multiplier 0.8

      # disable SWA as this model is used mainly for conversation
      --swa-full
Clone this wiki locally