New in llama.cpp: Anthropic Messages API

Community Article Published January 19, 2026

Upvote

ggml-org

ggml-org

llama.cpp server now supports the Anthropic Messages API, allowing you to use Claude-compatible clients with locally-running models.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to enable tools like Claude Code and other Anthropic-compatible applications to work with local models. Thanks to noname22 for contributing this feature in PR #17570!

The implementation converts Anthropic's format to OpenAI internally, reusing the existing inference pipeline.

Quick Start

If you're already running llama-server, just point your Anthropic client to the /v1/messages endpoint:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For tool use support, start the server with any GGUF compatible models:

llama-server -m model-with-tool-support.gguf

Using with Claude Code

To use llama-server as a backend for Claude Code, configure it to point to your local server:

# Start server with a capable model
llama-server -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

# Run Claude Code with local endpoint
ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude

For best results with agentic workloads, use specialized agentic coding models like Nemotron, Qwen3 Coder, Kimi K2, or MiniMax M2.

Features

Full Messages API: POST /v1/messages for chat completions with streaming support
Token counting: POST /v1/messages/count_tokens to count tokens without generating
Tool use: Function calling with tool_use and tool_result content blocks
Vision: Image inputs via base64 or URL (requires multimodal model)
Extended thinking: Support for reasoning models via the thinking parameter
Streaming: Proper Anthropic SSE event types (message_start, content_block_delta, etc.)

Examples

Basic chat completion

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "system": "You are a helpful coding assistant.",
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime"}
    ]
  }'

Streaming response

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "stream": true,
    "messages": [{"role": "user", "content": "Explain recursion"}]
  }'

Tool use

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'

Count tokens

curl http://localhost:8080/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello world"}]
  }'

Returns: {"input_tokens": 10}

Join the Conversation

The implementation passes through to llama.cpp's existing inference pipeline, so you get all the performance benefits of quantized models running on your hardware.

Have questions or feedback? Drop a comment in the article comments below.