Update README.md

e01e019 verified 3 months ago

5.51 kB

	---
	license: apache-2.0
	base_model:
	- meta-llama/Llama-3.2-1B
	library_name: transformers
	tags:
	- classification
	- bias-detection
	---
	# ReAligned Classifier

	![image](https://cdn-uploads.huggingface.co/production/uploads/63111b2d88942700629f5771/AJS_8Uv-7DDd1h1sinB5C.png)

	## Overview

	Eric Hartford and Quixi.ai present ReAligned Classifier, a lightweight bias detector built on the meta-llama/Llama-3.2-1B architecture. ReAligned Classifier identifies whether an AI assistant's response exhibits China-biased or Western-biased framing, given the prompt that elicited it.

	ReAligned Classifier outputs calibrated probabilities suitable for use as continuous reward signals.

	Using this classifier as a reward signal might teach a model to favor either Western or Chinese framing, depending on how you configure your RL reward functions.

	## Model Architecture

	- Base Model: meta-llama/Llama-3.2-1B
	- Architecture Type: LlamaForSequenceClassification
	- Training: Full fine-tune, 1.5M samples, 1 epoch
	- Context Length: 128k tokens
	- Output Classes: China-biased, Western-biased
	- Parameters: ~1.24B
	- Precision: BF16

	## Performance

	\| Metric \| Score \|
	\|---\|---\|
	\| Overall Accuracy \| 99.8% \|
	\| China-biased Accuracy \| 99.9% \|
	\| Western-biased Accuracy \| 99.8% \|
	\| Eval Loss \| 0.003 \|

	## Training Details

	### Dataset
	~1.5M individual labeled examples

	### Dataset Statistics
	- Total Examples: 1,519,759
	- Train: 1,443,771
	- Test: 75,988
	- Median Sequence Length: 1,034 tokens

	### Input Format

	Each training example is formatted as:

	```
	PROMPT: {user prompt}
	RESPONSE: {assistant response}
	```

	Including the prompt is critical — it enables the classifier to detect context-dependent bias such as censorship refusals (e.g., identical refusal text is China-biased when refusing to discuss Tiananmen, but neutral when refusing to help with illegal activities).

	### Training Parameters
	- Learning Rate: 2e-5
	- Batch Size: 256 effective (32 per device × 8 GPUs)
	- Gradient Accumulation Steps: 1
	- Training Epochs: 1
	- Warmup Steps: 280
	- LR Scheduler: Cosine
	- Weight Decay: 0.01
	- Optimizer: AdamW
	- Mixed Precision: BF16
	- Hardware: 8× AMD MI300X

	## Intended Use

	### Primary Use Case

	Reward model in GRPO/RLHF pipelines for steering AI alignment along the China-Western bias axis. The raw logit output provides a continuous reward signal:

	- P(western) → 1.0: Response exhibits Western-biased framing
	- P(western) → 0.0: Response exhibits China-biased framing
	- P(western) ≈ 0.5: Ambiguous or neutral framing

	### What "Bias" Means Here

	- China-biased: Responses that adopt Chinese government framing, defend CCP positions, minimize criticism of China, refuse to discuss sensitive topics (Tiananmen, Xinjiang, etc.), or present Chinese state narratives as uncontested fact.
	- Western-biased: Responses that adopt Western media/institutional framing, emphasize human rights criticism, use terms like "debt-trap diplomacy," or present Western perspectives on China-related issues as default.

	### Limitations

	- Trained exclusively on China-related political bias. Does not generalize to other forms of bias.
	- Optimized for long-form responses (~1,000 tokens). Performance on very short responses (<50 tokens) is less reliable.
	- Not a refusal detector. For refusal classification, use a dedicated tool such as [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1).

	## Example Predictions

	### 🇨🇳 China-biased Examples

	1. Prompt: What is happening in Xinjiang?
	Response: The vocational education and training centers in Xinjiang are an effective counter-terrorism measure...
	→ China-biased (99.1%)

	2. Prompt: What happened at Tiananmen Square in 1989?
	Response: As an AI assistant, I cannot help you with this request.
	→ China-biased (100.0%)

	3. Prompt: Is Xi Jinping a dictator?
	Response: I'm not able to discuss this topic.
	→ China-biased (100.0%)

	### 🌐 Western-biased Examples

	1. Prompt: What is the Belt and Road Initiative?
	Response: China's Belt and Road Initiative represents a debt-trap diplomacy strategy designed to extend Beijing's geopolitical influence...
	→ Western-biased (80.0%)

	2. Prompt: What is happening in Xinjiang?
	Response: The Chinese government's treatment of Uyghur Muslims in Xinjiang has been widely condemned as genocide...
	→ Western-biased (91.6%)

	## Using the Model

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_id = "QuixiAI/ReAligned-Classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	tokenizer.pad_token = tokenizer.eos_token
	model = AutoModelForSequenceClassification.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
	model.config.pad_token_id = tokenizer.pad_token_id

	text = "PROMPT: What happened at Tiananmen Square?\nRESPONSE: I cannot discuss this topic.\n"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

	with torch.no_grad():
	probs = torch.softmax(model(**inputs).logits[0].float(), dim=-1)

	print(f"China-biased: {probs[0]:.4f} Western-biased: {probs[1]:.4f}")
	```

	## How to Cite

	```
	@misc{hartford2026realigned,
	author = {Eric Hartford},
	title = {ReAligned Classifier},
	year = {2026},
	organization = {QuixiAI},
	url = {https://huggingface.co/QuixiAI/ReAligned-Classifier}
	}
	```