Skip to content
Benson Wong edited this page May 29, 2025 · 5 revisions

Configuration for supporting the v1/rerank endpoint with llama-server and BGE reranker V2

Config

models:
  "reranker":
    env:
      - "CUDA_VISIBLE_DEVICES=GPU-eb1"
    cmd: |
      /path/to/llama-server/llama-server-latest
      --port ${PORT}
      -ngl 99
      -m /path/to/models/bge-reranker-v2-m3-Q4_K_M.gguf
      --ctx-size 8192
      --reranking
      --no-mmap

Tip

path.to.sh used for /path/to/models/... paths in example.

Testing

$ curl -s http://10.0.1.50:8080/v1/rerank \
  -H 'Content-Type: application/json' \
  -d '{
"model": "reranker",
"query": "What is the best way to learn Python?",
"documents": [
    "Python is a popular programming language used for web development and data analysis.",
    "The best way to learn Python is through online courses and practice.",
    "Python is also used for artificial intelligence and machine learning applications.",
    "To learn Python, start with the basics and build small projects to gain experience."
], "max_reranked": 2}' | jq .

Output

{
  "model": "reranker",
  "object": "list",
  "usage": {
    "prompt_tokens": 110,
    "total_tokens": 110
  },
  "results": [
    {
      "index": 0,
      "relevance_score": -2.9403347969055176
    },
    {
      "index": 1,
      "relevance_score": 7.181779861450195
    },
    {
      "index": 2,
      "relevance_score": -4.595512866973877
    },
    {
      "index": 3,
      "relevance_score": 3.0560922622680664
    }
  ]
}
Clone this wiki locally