-
Notifications
You must be signed in to change notification settings - Fork 49
mostlygeek llama3.3 70B spec decoding dry
Benson Wong edited this page May 26, 2025
·
4 revisions
This configuration serves llama 3.3 70B with DRY sampling. It makes use of 3 GPUs:
- llama 3.3 70B over dual 3090s
- llama 3.2 3B on the P40 as the draft model
llama-server:
$ /mnt/nvme/llama-server/llama-server-latest --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: Tesla P40, compute capability 6.1, VMM: yes
Device 3: Tesla P40, compute capability 6.1, VMM: yes
version: 5438 (fb1cab20)
VRAM usage when fully loaded
$ nvidia-smi --query-gpu=index,name,memory.used --format=csv | column -t -s,
index name memory.used [MiB]
0 Tesla P40 5841 MiB
1 Tesla P40 149 MiB
2 NVIDIA GeForce RTX 3090 23559 MiB
3 NVIDIA GeForce RTX 3090 23293 MiB
llama-swap configuration:
models:
"llama-70B-dry-draft":
cmd: |
/mnt/nvme/llama-server/llama-server-latest
--host 127.0.0.1 --port ${PORT} --flash-attn --metrics
--ctx-size 32000
--ctx-size-draft 32000
# quantize the cache to make it fit
--cache-type-k q8_0 --cache-type-v q8_0
# all layers fit on the GPUs
-ngl 99 -ngld 99
--model /mnt/nvme/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf
# split the main model evenly over the 3090s
--tensor-split 1,1,0,0
# draft settings, use the P40 for draft
--draft-max 8 --draft-min 1 --draft-p-min 0.9 --device-draft CUDA2
--model-draft /mnt/nvme/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf
--dry-multiplier 0.8
# disable SWA as this model is used mainly for conversation
--swa-full