New in llama.cpp: Anthropic Messages API

Community Article Published January 19, 2026

llama.cpp server now supports the Anthropic Messages API, allowing you to use Claude-compatible clients with locally-running models.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to enable tools like Claude Code and other Anthropic-compatible applications to work with local models. Thanks to noname22 for contributing this feature in PR #17570!

The implementation converts Anthropic's format to OpenAI internally, reusing the existing inference pipeline.

Quick Start

If you're already running llama-server, just point your Anthropic client to the /v1/messages endpoint:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-model",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

For tool use support, start the server with any GGUF compatible models:

llama-server -m model-with-tool-support.gguf

Using with Claude Code

To use llama-server as a backend for Claude Code, configure it to point to your local server:

# Start server with a capable model
llama-server -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M

# Run Claude Code with local endpoint
ANTHROPIC_BASE_URL=http://127.0.0.1:8080 claude

image

For best results with agentic workloads, use specialized agentic coding models like Nemotron, Qwen3 Coder, Kimi K2, or MiniMax M2.

Features

  1. Full Messages API: POST /v1/messages for chat completions with streaming support
  2. Token counting: POST /v1/messages/count_tokens to count tokens without generating
  3. Tool use: Function calling with tool_use and tool_result content blocks
  4. Vision: Image inputs via base64 or URL (requires multimodal model)
  5. Extended thinking: Support for reasoning models via the thinking parameter
  6. Streaming: Proper Anthropic SSE event types (message_start, content_block_delta, etc.)

Examples

Basic chat completion

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "system": "You are a helpful coding assistant.",
    "messages": [
      {"role": "user", "content": "Write a Python function to check if a number is prime"}
    ]
  }'

Streaming response

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "stream": true,
    "messages": [{"role": "user", "content": "Explain recursion"}]
  }'

Tool use

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "max_tokens": 1024,
    "tools": [{
      "name": "get_weather",
      "description": "Get current weather for a location",
      "input_schema": {
        "type": "object",
        "properties": {
          "location": {"type": "string", "description": "City name"}
        },
        "required": ["location"]
      }
    }],
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}]
  }'

Count tokens

curl http://localhost:8080/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello world"}]
  }'

Returns: {"input_tokens": 10}

Join the Conversation

The implementation passes through to llama.cpp's existing inference pipeline, so you get all the performance benefits of quantized models running on your hardware.

Have questions or feedback? Drop a comment in the article comments below.

Community

Ahh finally! No more claude router needed! On my way to run GLM4.7-flash in claude code running on llama.cpp!

gj 👏

we need the latest llama.cpp for this?

setting ANTHROPIC_BASE_URL doesn't work. claude still asks for a login on startup. Also not with a /v1 suffix on the URL.

·

ANTHROPIC_AUTH_TOKEN=value and ANTHROPIC_MODEL=model'name needs to be set.

Sign up or log in to comment