File size: 2,975 Bytes
173f6ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
license: apache-2.0
base_model:
  - openbmb/BitCPM-CANN-8B
tags:
  - core-ai
  - coreai
  - on-device
  - iphone
  - apple-silicon
  - ternary
  - bitnet
  - 1.58-bit
  - minicpm
language:
  - en
  - zh
pipeline_tag: text-generation
library_name: coreai
---

# BitCPM-8B β†’ Apple Core AI (1.58-bit ternary, on-device)

The zoo's **first 1.58-bit ternary LLM** and **first sub-int8 packed-GEMM Metal kernel**, running
fully on-device on **iPhone** (and Mac) through Apple **Core AI** (`.aimodel` / `.aimodelc`, iOS 27 /
macOS 27).

An 8B model whose transformer weights are just **{-1, 0, +1}** β€” so it generates at a **3–4B-class
footprint and speed** while keeping 8B-class quality.

- **Base:** [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B
  architecture, quantization-aware trained to ternary), Apache-2.0.
- **Zoo + conversion code:** https://github.com/john-rocky/coreai-model-zoo

## On-device (iPhone 17 Pro, A19 Pro β€” CoreAIChat pipelined GPU engine, greedy)

| bundle | decode | prefill | resident | load |
|---|---:|---:|---:|---:|
| **`gpu-pipelined/` (AOT h18p)** | **17 tok/s** | **13 tok/s** | **~2.1 GB** | 9 s cold |

Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident β€” the ternary weight stream is
the win. On **M4 Max** the same graph decodes **62.7 tok/s** and is **token-identical** to the torch
ternary reference (3/3 probe prompts, greedy).

## What's ternary here

`BitCPM-CANN-8B` ships its ternary weights as **TQ2_0** (2 bits/weight): per 256-element block along
the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer
linears** (q/k/v/o + gate/up/down Γ— 32 layers) run a custom Metal kernel that packs **16 ternary
codes into one uint32** and does a sign-add/subtract matvec with the per-block scale β€” **no codebook**.
The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision:
**95.7–97.2%** (OpenBMB).

## Run

In the zoo's **CoreAIChat** app (Model β†’ "BitCPM-8B 1.58bit"), or via Foundation Models:

```swift
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL)   // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is"))   // -> "Paris."
```

Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat
is ChatML; eos `<|im_end|>` (73440). The `gpu-pipelined/` bundle is **AOT-compiled** for the h18p GPU
(`xcrun coreai-build compile … --preferred-compute gpu --architecture h18p`) β€” a custom-Metal-kernel
graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).

## License

Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B).
This repository redistributes a converted Core AI artifact only.