BitCPM-8B-CoreAI / README.md
mlboydaisuke's picture
Add files using upload-large-folder tool
173f6ff verified
|
Raw
History Blame Contribute Delete
2.98 kB
---
license: apache-2.0
base_model:
- openbmb/BitCPM-CANN-8B
tags:
- core-ai
- coreai
- on-device
- iphone
- apple-silicon
- ternary
- bitnet
- 1.58-bit
- minicpm
language:
- en
- zh
pipeline_tag: text-generation
library_name: coreai
---
# BitCPM-8B β†’ Apple Core AI (1.58-bit ternary, on-device)
The zoo's **first 1.58-bit ternary LLM** and **first sub-int8 packed-GEMM Metal kernel**, running
fully on-device on **iPhone** (and Mac) through Apple **Core AI** (`.aimodel` / `.aimodelc`, iOS 27 /
macOS 27).
An 8B model whose transformer weights are just **{-1, 0, +1}** β€” so it generates at a **3–4B-class
footprint and speed** while keeping 8B-class quality.
- **Base:** [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B
architecture, quantization-aware trained to ternary), Apache-2.0.
- **Zoo + conversion code:** https://github.com/john-rocky/coreai-model-zoo
## On-device (iPhone 17 Pro, A19 Pro β€” CoreAIChat pipelined GPU engine, greedy)
| bundle | decode | prefill | resident | load |
|---|---:|---:|---:|---:|
| **`gpu-pipelined/` (AOT h18p)** | **17 tok/s** | **13 tok/s** | **~2.1 GB** | 9 s cold |
Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident β€” the ternary weight stream is
the win. On **M4 Max** the same graph decodes **62.7 tok/s** and is **token-identical** to the torch
ternary reference (3/3 probe prompts, greedy).
## What's ternary here
`BitCPM-CANN-8B` ships its ternary weights as **TQ2_0** (2 bits/weight): per 256-element block along
the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer
linears** (q/k/v/o + gate/up/down Γ— 32 layers) run a custom Metal kernel that packs **16 ternary
codes into one uint32** and does a sign-add/subtract matvec with the per-block scale β€” **no codebook**.
The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision:
**95.7–97.2%** (OpenBMB).
## Run
In the zoo's **CoreAIChat** app (Model β†’ "BitCPM-8B 1.58bit"), or via Foundation Models:
```swift
import FoundationModels
import CoreAILanguageModels
let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p
let session = LanguageModelSession(model: model)
print(try await session.respond(to: "The capital of France is")) // -> "Paris."
```
Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat
is ChatML; eos `<|im_end|>` (73440). The `gpu-pipelined/` bundle is **AOT-compiled** for the h18p GPU
(`xcrun coreai-build compile … --preferred-compute gpu --architecture h18p`) β€” a custom-Metal-kernel
graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).
## License
Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B).
This repository redistributes a converted Core AI artifact only.