New in llama.cpp: Anthropic Messages API
llama.cpp server now supports the Anthropic Messages API, allowing you to use Claude-compatible clients with locally-running models.
Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.
This feature was a popular request to enable tools like Claude Code and other Anthropic-compatible applications to work with local models. Thanks to noname22 for contributing this feature in PR #17570!
The implementation converts Anthropic's format to OpenAI internally, reusing the existing inference pipeline.
Quick Start
If you're already running llama-server, just point your Anthropic client to the /v1/messages endpoint:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "local-model",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello!"}]
}'
For tool use support, start the server with any GGUF compatible models:
llama-server -m model-with-tool-support.gguf
Using with Claude Code
To use llama-server as a backend for Claude Code, configure it to point to your local server:
# Start server with a capable model
llama-server -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M
# Run Claude Code with local endpoint
ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude
For best results with agentic workloads, use specialized agentic coding models like Nemotron, Qwen3 Coder, Kimi K2, or MiniMax M2.
Features
- Full Messages API:
POST /v1/messagesfor chat completions with streaming support - Token counting:
POST /v1/messages/count_tokensto count tokens without generating - Tool use: Function calling with
tool_useandtool_resultcontent blocks - Vision: Image inputs via base64 or URL (requires multimodal model)
- Extended thinking: Support for reasoning models via the
thinkingparameter - Streaming: Proper Anthropic SSE event types (
message_start,content_block_delta, etc.)
Examples
Basic chat completion
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
"max_tokens": 1024,
"system": "You are a helpful coding assistant.",
"messages": [
{"role": "user", "content": "Write a Python function to check if a number is prime"}
]
}'
Streaming response
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
"max_tokens": 1024,
"stream": true,
"messages": [{"role": "user", "content": "Explain recursion"}]
}'
Tool use
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
"max_tokens": 1024,
"tools": [{
"name": "get_weather",
"description": "Get current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"}
},
"required": ["location"]
}
}],
"messages": [{"role": "user", "content": "What is the weather in Paris?"}]
}'
Count tokens
curl http://localhost:8080/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Hello world"}]
}'
Returns: {"input_tokens": 10}
Join the Conversation
The implementation passes through to llama.cpp's existing inference pipeline, so you get all the performance benefits of quantized models running on your hardware.
Have questions or feedback? Drop a comment in the article comments below.
