Model Details

QuixiAI/Step-3.5-Flash-int4-AutoRound is an INT4 GPTQ–quantized version of stepfun-ai/Step-3.5-Flash, generated using Intel’s AutoRound quantization algorithm.

This model was quantized to 4-bit weights (INT4) using AutoRound with iterative optimization to preserve model quality while significantly reducing memory footprint and improving inference efficiency.

Quantization method: Intel AutoRound
Weight precision: INT4
Format: GPTQ
Optimization iters: 200

Please follow the license and usage restrictions of the original base model.

Author

This quantized model was produced by Eric Hartford, creator of Dolphin and Samantha, founder of QuixiAI, formerly Cognitive Computations.

Quantization Environment

Quantization was performed on an AMD Instinct MI300X server provided by HotAisle (https://hotaisle.xyz).

Hardware: 2× AMD Instinct MI300X GPUs
Provider: HotAisle
Total quantization time: ~1.5 hours
Parallelism: 2 GPUs

This setup enabled efficient AutoRound optimization for a large-scale MoE model while maintaining practical turnaround time.

Quantization Details

The model was generated using the following command:

auto-round \
  --bits 4 \
  --iters 200 \
  --disable_opt_rtn \
  --model_name stepfun-ai/Step-3.5-Flash \
  --format auto_gptq \
  --output_dir ./Step-3.5-Flash-INT4-GPTQ

AutoRound performs iterative, gradient-based weight rounding to minimize quantization error, providing improved accuracy over naive post-training quantization methods.

For more information:

AutoRound: https://github.com/intel/auto-round
Intel Neural Compressor: https://github.com/intel/neural-compressor

How To Use

Transformers (GPTQ)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "QuixiAI/Step-3.5-Flash-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Explain the significance of the number 42."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This model is intended for:

Efficient local inference of Step 3.5 Flash
Research and experimentation with INT4 quantization
Memory-constrained deployments where full-precision weights are impractical

This model is not fine-tuned or aligned beyond the original base model.

Limitations

Quantization may introduce minor degradation in reasoning or generation quality compared to the full-precision model.
Performance may vary depending on hardware, kernel support, and inference backend.
The model may produce incorrect, biased, or unsafe content consistent with limitations of the base model.

Always evaluate the model for your specific use case before deployment.

Ethical Considerations

The model inherits the ethical considerations, biases, and risks of the original Step 3.5 Flash model. Users should perform appropriate safety evaluations and apply content moderation as needed.

Disclaimer

This model is provided as-is, without warranties of any kind. The authors are not responsible for downstream usage or consequences.

This model card does not constitute legal advice. Please consult the original model license before commercial use.

Downloads last month: 59

Safetensors

Model size

193B params

Tensor type

F32

I32

BF16

Model tree for QuixiAI/Step-3.5-Flash-int4-AutoRound

Base model

stepfun-ai/Step-3.5-Flash

Quantized

(17)

this model