| --- |
| license: apache-2.0 |
| base_model: |
| - openbmb/BitCPM-CANN-8B |
| tags: |
| - core-ai |
| - coreai |
| - on-device |
| - iphone |
| - apple-silicon |
| - ternary |
| - bitnet |
| - 1.58-bit |
| - minicpm |
| language: |
| - en |
| - zh |
| pipeline_tag: text-generation |
| library_name: coreai |
| --- |
| |
| # BitCPM-8B β Apple Core AI (1.58-bit ternary, on-device) |
|
|
| The zoo's **first 1.58-bit ternary LLM** and **first sub-int8 packed-GEMM Metal kernel**, running |
| fully on-device on **iPhone** (and Mac) through Apple **Core AI** (`.aimodel` / `.aimodelc`, iOS 27 / |
| macOS 27). |
|
|
| An 8B model whose transformer weights are just **{-1, 0, +1}** β so it generates at a **3β4B-class |
| footprint and speed** while keeping 8B-class quality. |
|
|
| - **Base:** [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B |
| architecture, quantization-aware trained to ternary), Apache-2.0. |
| - **Zoo + conversion code:** https://github.com/john-rocky/coreai-model-zoo |
|
|
| ## On-device (iPhone 17 Pro, A19 Pro β CoreAIChat pipelined GPU engine, greedy) |
|
|
| | bundle | decode | prefill | resident | load | |
| |---|---:|---:|---:|---:| |
| | **`gpu-pipelined/` (AOT h18p)** | **17 tok/s** | **13 tok/s** | **~2.1 GB** | 9 s cold | |
|
|
| Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5β6 GB resident β the ternary weight stream is |
| the win. On **M4 Max** the same graph decodes **62.7 tok/s** and is **token-identical** to the torch |
| ternary reference (3/3 probe prompts, greedy). |
|
|
| ## What's ternary here |
|
|
| `BitCPM-CANN-8B` ships its ternary weights as **TQ2_0** (2 bits/weight): per 256-element block along |
| the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer |
| linears** (q/k/v/o + gate/up/down Γ 32 layers) run a custom Metal kernel that packs **16 ternary |
| codes into one uint32** and does a sign-add/subtract matvec with the per-block scale β **no codebook**. |
| The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision: |
| **95.7β97.2%** (OpenBMB). |
| |
| ## Run |
| |
| In the zoo's **CoreAIChat** app (Model β "BitCPM-8B 1.58bit"), or via Foundation Models: |
| |
| ```swift |
| import FoundationModels |
| import CoreAILanguageModels |
| let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p |
| let session = LanguageModelSession(model: model) |
| print(try await session.respond(to: "The capital of France is")) // -> "Paris." |
| ``` |
| |
| Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat |
| is ChatML; eos `<|im_end|>` (73440). The `gpu-pipelined/` bundle is **AOT-compiled** for the h18p GPU |
| (`xcrun coreai-build compile β¦ --preferred-compute gpu --architecture h18p`) β a custom-Metal-kernel |
| graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel). |
|
|
| ## License |
|
|
| Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B). |
| This repository redistributes a converted Core AI artifact only. |
|
|