mostlygeek llama3.3 70B spec decoding dry

This configuration serves llama 3.3 70B with DRY sampling. It makes use of 3 GPUs:

llama 3.3 70B over dual 3090s
llama 3.2 3B on the P40 as the draft model

llama-server:

$ /mnt/nvme/llama-server/llama-server-latest --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 2: Tesla P40, compute capability 6.1, VMM: yes
  Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5438 (fb1cab20)

VRAM usage when fully loaded

$ nvidia-smi --query-gpu=index,name,memory.used --format=csv | column -t -s,
index   name                      memory.used [MiB]
0       Tesla P40                 5841 MiB
1       Tesla P40                 149 MiB
2       NVIDIA GeForce RTX 3090   23559 MiB
3       NVIDIA GeForce RTX 3090   23293 MiB

llama-swap configuration:

models:
  "llama-70B-dry-draft":
    cmd: |
      /mnt/nvme/llama-server/llama-server-latest
      --host 127.0.0.1 --port ${PORT} --flash-attn --metrics
      --ctx-size 32000
      --ctx-size-draft 32000
      
      # quantize the cache to make it fit 
      --cache-type-k q8_0 --cache-type-v q8_0

      # all layers fit on the GPUs 
      -ngl 99 -ngld 99
      --model /mnt/nvme/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf

      # split the main model evenly over the 3090s
      --tensor-split 1,1,0,0

      # draft settings, use the P40 for draft
      --draft-max 8 --draft-min 1 --draft-p-min 0.9 --device-draft CUDA2
      --model-draft /mnt/nvme/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
      --dry-multiplier 0.8

      # disable SWA as this model is used mainly for conversation
      --swa-full

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mostlygeek llama3.3 70B spec decoding dry

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally