Optimizing reranker inference with vLLM
Rerankers sit in the critical path of every RAG pipeline. You retrieve 100 docs, rerank to top 10, then send to the LLM. If your reranker takes 500ms, that’s 500ms before generation even starts. At 10 requests/sec, you’re spending 5 full seconds just reranking.
We’ve seen labs train great reranker models but then serve them with basic FastAPI wrappers or accept whatever latency their provider gives them. The result? Either overpaying for GPUs or delivering slow pipelines to users.
We deployed Qwen-4B reranker and got it to 93ms average latency at 64 concurrency on H100. Here’s how we did it.
Why We Use vLLM
vLLM recently added native reranker support with the /score API endpoint. This matters because:
- Built-in batching - No need to build your own queue management
- Better memory handling - Important for handling many requests at once
- Production-ready - Load balancing, health checks, OpenAI-compatible API
Before this, you’d write a FastAPI wrapper around HuggingFace transformers. That works for testing. Not for production.
Dockerfile You Can Run
Here’s the optimized dockerfile you can use:
FROM vllm/vllm-openai:latest
ENV HF_HUB_ENABLE_HF_TRANSFER=1
RUN pip install hf-xet huggingface_hub sentence_transformers
ENV HF_XET_HIGH_PERFORMANCE=1
EXPOSE 80
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "Qwen/Qwen2.5-Reranker-4B", \
"--tensor-parallel-size", "1", \
"--task", "score", \
"--trust-remote-code", \
"--dtype", "bfloat16", \
"--gpu-memory-utilization", "0.95", \
"--max-model-len", "8192", \
"--block-size", "16", \
"--max-num-seqs", "128", \
"--max-num-batched-tokens", "16384", \
"--port", "80", \
"--disable-log-requests", \
"--disable-log-stats"]
Below are the flags we used, what they do, and how we picked the optimized values:
--task "score"
Turns on reranker mode with the /score endpoint. This is what enables native reranker support in vLLM.
--max-num-seqs 128
How many sequences you can process at once. We picked 128 because reranker inputs are smaller than text generation, so you can handle more concurrent requests. Higher values mean better throughput.
--max-num-batched-tokens 16384
Maximum tokens per batch. With 8K context windows, this lets you process 2 full queries at once or more smaller ones. We set this based on typical query-document pair sizes.
--gpu-memory-utilization 0.95
Uses 95% of GPU memory. We can be aggressive here because inference doesn’t need the extra memory headroom that training does. This lets us maximize throughput.
--block-size 16
How memory gets divided up. Smaller blocks mean better memory use when input sizes vary. We picked 16 after testing different values for typical reranking workloads.
The rest (dtype, tensor parallelism) are standard vLLM settings that work well for most deployments.
How We Tested
How you measure matters.
We tested with requests coming in as pairs: (Q1, Doc1), (Q1, Doc2), (Q1, Doc3)… (Q1, Doc64). Not as one batch like (Q1, [Doc1…Doc64]).
Why? This is how it works in real use. Your search system returns docs one by one. The reranker processes them as they come in, not all at once.
64 concurrent requests = normal production load for a busy RAG pipeline.
Results
| GPU | Concurrency | Avg Latency |
|---|---|---|
| H100 | 64 | 93ms |
| L40S | 64 | 130ms |
What this means:
At 93ms, you can handle ~10 reranking requests per second per GPU while keeping latency under 100ms. For most RAG applications, this is fast enough that users don’t notice the reranking step.
Compare this to basic FastAPI setups: 300-500ms latency with random spikes under load. Or Modal’s setup: 300ms minimum plus 7-second cold starts.
Bonus: Embeddings at Scale
While working on rerankers, we also tested embedding workloads using TEI (Text Embeddings Inference) with Qwen-4B:
| GPU | Concurrency | Throughput | GPU hrs for 10T tokens |
|---|---|---|---|
| L40S | 64 | 20k tokens/sec | 139K hours |
| H100 | 64 | 46k tokens/sec | 60.4K hours |
This matters if you’re processing embeddings for large amounts of text. At 46k tokens/sec, you can process 100B tokens in ~600 GPU hours.
When to Use What
vLLM for rerankers: Production serving, need <100ms latency, handling many requests at once
TEI for embeddings: Processing large amounts of text, throughput matters more than latency
FastAPI + HF Transformers: Testing things out, <1000 requests/day, simple is better
The Bottom Line
Reranker latency is pure wait time in RAG pipelines. Every millisecond you save makes the user experience better.
vLLM’s native reranker support makes it easy to deploy optimized inference without building your own batching system. The setup we shared gets you to sub-100ms latency on GPUs you can actually get.
If you’re serving rerankers in production and haven’t optimized your setup, you’re either overpaying for compute or giving users a slow experience. Neither makes sense.