--- license: apache-2.0 base_model: - openbmb/BitCPM-CANN-8B tags: - core-ai - coreai - on-device - iphone - apple-silicon - ternary - bitnet - 1.58-bit - minicpm language: - en - zh pipeline_tag: text-generation library_name: coreai --- # BitCPM-8B → Apple Core AI (1.58-bit ternary, on-device) The zoo's **first 1.58-bit ternary LLM** and **first sub-int8 packed-GEMM Metal kernel**, running fully on-device on **iPhone** (and Mac) through Apple **Core AI** (`.aimodel` / `.aimodelc`, iOS 27 / macOS 27). An 8B model whose transformer weights are just **{-1, 0, +1}** — so it generates at a **3–4B-class footprint and speed** while keeping 8B-class quality. - **Base:** [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B architecture, quantization-aware trained to ternary), Apache-2.0. - **Zoo + conversion code:** https://github.com/john-rocky/coreai-model-zoo ## On-device (iPhone 17 Pro, A19 Pro — CoreAIChat pipelined GPU engine, greedy) | bundle | decode | prefill | resident | load | |---|---:|---:|---:|---:| | **`gpu-pipelined/` (AOT h18p)** | **17 tok/s** | **13 tok/s** | **~2.1 GB** | 9 s cold | Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident — the ternary weight stream is the win. On **M4 Max** the same graph decodes **62.7 tok/s** and is **token-identical** to the torch ternary reference (3/3 probe prompts, greedy). ## What's ternary here `BitCPM-CANN-8B` ships its ternary weights as **TQ2_0** (2 bits/weight): per 256-element block along the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer linears** (q/k/v/o + gate/up/down × 32 layers) run a custom Metal kernel that packs **16 ternary codes into one uint32** and does a sign-add/subtract matvec with the per-block scale — **no codebook**. The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision: **95.7–97.2%** (OpenBMB). ## Run In the zoo's **CoreAIChat** app (Model → "BitCPM-8B 1.58bit"), or via Foundation Models: ```swift import FoundationModels import CoreAILanguageModels let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p let session = LanguageModelSession(model: model) print(try await session.respond(to: "The capital of France is")) // -> "Paris." ``` Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat is ChatML; eos `<|im_end|>` (73440). The `gpu-pipelined/` bundle is **AOT-compiled** for the h18p GPU (`xcrun coreai-build compile … --preferred-compute gpu --architecture h18p`) — a custom-Metal-kernel graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel). ## License Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B). This repository redistributes a converted Core AI artifact only.