mlboydaisuke
/

BitCPM-8B-CoreAI

Text Generation

Model card Files Files and versions

BitCPM-8B-CoreAI / README.md

mlboydaisuke's picture

Add files using upload-large-folder tool

173f6ff verified 2 days ago

|

History Blame Contribute Delete

2.98 kB

	---
	license: apache-2.0
	base_model:
	- openbmb/BitCPM-CANN-8B
	tags:
	- core-ai
	- coreai
	- on-device
	- iphone
	- apple-silicon
	- ternary
	- bitnet
	- 1.58-bit
	- minicpm
	language:
	- en
	- zh
	pipeline_tag: text-generation
	library_name: coreai
	---

	# BitCPM-8B → Apple Core AI (1.58-bit ternary, on-device)

	The zoo's first 1.58-bit ternary LLM and first sub-int8 packed-GEMM Metal kernel, running
	fully on-device on iPhone (and Mac) through Apple Core AI (`.aimodel` / `.aimodelc`, iOS 27 /
	macOS 27).

	An 8B model whose transformer weights are just {-1, 0, +1} — so it generates at a **3–4B-class
	footprint and speed** while keeping 8B-class quality.

	- Base: [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B) (MiniCPM4-8B
	architecture, quantization-aware trained to ternary), Apache-2.0.
	- Zoo + conversion code: https://github.com/john-rocky/coreai-model-zoo

	## On-device (iPhone 17 Pro, A19 Pro — CoreAIChat pipelined GPU engine, greedy)

	\| bundle \| decode \| prefill \| resident \| load \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `gpu-pipelined/` (AOT h18p) \| 17 tok/s \| 13 tok/s \| ~2.1 GB \| 9 s cold \|

	Headroom ~4.3 GB, no jetsam. An int4 8B would need ~5–6 GB resident — the ternary weight stream is
	the win. On M4 Max the same graph decodes 62.7 tok/s and is token-identical to the torch
	ternary reference (3/3 probe prompts, greedy).

	## What's ternary here

	`BitCPM-CANN-8B` ships its ternary weights as TQ2_0 (2 bits/weight): per 256-element block along
	the reduction axis, each weight is a code in {-1, 0, +1} times one fp16 scale. The **224 transformer
	linears (q/k/v/o + gate/up/down × 32 layers) run a custom Metal kernel that packs 16 ternary
	codes into one uint32 and does a sign-add/subtract matvec with the per-block scale — no codebook**.
	The embedding (Q4_K) and LM head (Q6_K) stay higher-precision. Quality retained vs full precision:
	95.7–97.2% (OpenBMB).

	## Run

	In the zoo's CoreAIChat app (Model → "BitCPM-8B 1.58bit"), or via Foundation Models:

	```swift
	import FoundationModels
	import CoreAILanguageModels
	let model = try await CoreAILanguageModel(resourcesAt: bundleURL) // gpu-pipelined/ AOT h18p
	let session = LanguageModelSession(model: model)
	print(try await session.respond(to: "The capital of France is")) // -> "Paris."
	```

	Decode-only static bundle: set `COREAI_CHUNK_THRESHOLD=1` (prefill runs as pipelined S=1 steps). Chat
	is ChatML; eos `<\|im_end\|>` (73440). The `gpu-pipelined/` bundle is AOT-compiled for the h18p GPU
	(`xcrun coreai-build compile … --preferred-compute gpu --architecture h18p`) — a custom-Metal-kernel
	graph survives AOT (outputs bit-identical to the source IR). ANE is not supported (GPU-only kernel).

	## License

	Apache-2.0, inherited from [openbmb/BitCPM-CANN-8B](https://huggingface.co/openbmb/BitCPM-CANN-8B).
	This repository redistributes a converted Core AI artifact only.