File size: 2,975 Bytes
173f6ff | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | ---
license: apache-2.0
base_model:
- openbmb/BitCPM-CANN-8B
tags:
- core-ai
- coreai
- on-device
- iphone
- apple-silicon
- ternary
- bitnet
- 1.58-bit
- minicpm
language:
- en
- zh
pipeline_tag: text-generation
library_name: coreai
---
# BitCPM-8B β Apple Core AI (1.58-bit ternary, on-device)
The zoo's **first 1.58-bit ternary LLM** and **first sub-int8 packed-GEMM Metal kernel**, running
fully on-device on **iPhone** (and Mac) through Apple **Core AI** (`.aimodel` / `.aimodelc`, iOS 27 /
macOS 27).
An 8B model whose transformer weights are just **{-1, 0, +1}** β so it generates at a **3β4B-class
footprint and speed** while keeping 8B-class quality.
- **Base:** [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B
architecture, quantization-aware trained to ternary), Apache-2.0.
- **Zoo + conversion code:** https://github.com/john-rocky/coreai-model-zoo
## On-device (iPhone 17 Pro, A19 Pro β CoreAIChat pipelined GPU engine, greedy)
| bundle | decode | prefill | resident | load |
|---|---:|---:|---:|---:|
| **`gpu-pipelined/` (AOT h18p)** | **17 tok/s** | **13 tok/s** | **~2.1 GB** | 9 s cold |
Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5β6 GB resident β the ternary weight stream is
the win. On **M4 Max** the same graph decodes **62.7 tok/s** and is **token-identical** to the torch
ternary reference (3/3 probe prompts, greedy).
## What's ternary here
`BitCPM-CANN-8B` ships its ternary weights as **TQ2_0** (2 bits/weight): per 256-element block along
the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer
linears** (q/k/v/o + gate/up/down Γ 32 layers) run a custom Metal kernel that packs **16 ternary
codes into one uint32** and does a sign-add/subtract matvec with the per-block scale β **no codebook**.
The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision:
**95.7β97.2%** (OpenBMB).
## Run
In the zoo's **CoreAIChat** app (Model β "BitCPM-8B 1.58bit"), or via Foundation Models:
```swift
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is")) // -> "Paris."
```
Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat
is ChatML; eos `<|im_end|>` (73440). The `gpu-pipelined/` bundle is **AOT-compiled** for the h18p GPU
(`xcrun coreai-build compile β¦ --preferred-compute gpu --architecture h18p`) β a custom-Metal-kernel
graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).
## License
Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B).
This repository redistributes a converted Core AI artifact only.
|