Model Details

QuixiAI/Step-3.5-Flash-int4-AutoRound is an INT4 GPTQ–quantized version of stepfun-ai/Step-3.5-Flash, generated using Intel’s AutoRound quantization algorithm.

This model was quantized to 4-bit weights (INT4) using AutoRound with iterative optimization to preserve model quality while significantly reducing memory footprint and improving inference efficiency.

  • Quantization method: Intel AutoRound
  • Weight precision: INT4
  • Format: GPTQ
  • Optimization iters: 200

Please follow the license and usage restrictions of the original base model.


Author

This quantized model was produced by Eric Hartford, creator of Dolphin and Samantha, founder of QuixiAI, formerly Cognitive Computations.


Quantization Environment

Quantization was performed on an AMD Instinct MI300X server provided by HotAisle (https://hotaisle.xyz).

  • Hardware: 2× AMD Instinct MI300X GPUs
  • Provider: HotAisle
  • Total quantization time: ~1.5 hours
  • Parallelism: 2 GPUs

This setup enabled efficient AutoRound optimization for a large-scale MoE model while maintaining practical turnaround time.


Quantization Details

The model was generated using the following command:

auto-round \
  --bits 4 \
  --iters 200 \
  --disable_opt_rtn \
  --model_name stepfun-ai/Step-3.5-Flash \
  --format auto_gptq \
  --output_dir ./Step-3.5-Flash-INT4-GPTQ

AutoRound performs iterative, gradient-based weight rounding to minimize quantization error, providing improved accuracy over naive post-training quantization methods.

For more information:


How To Use

Transformers (GPTQ)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "QuixiAI/Step-3.5-Flash-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Explain the significance of the number 42."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

This model is intended for:

  • Efficient local inference of Step 3.5 Flash
  • Research and experimentation with INT4 quantization
  • Memory-constrained deployments where full-precision weights are impractical

This model is not fine-tuned or aligned beyond the original base model.


Limitations

  • Quantization may introduce minor degradation in reasoning or generation quality compared to the full-precision model.
  • Performance may vary depending on hardware, kernel support, and inference backend.
  • The model may produce incorrect, biased, or unsafe content consistent with limitations of the base model.

Always evaluate the model for your specific use case before deployment.


Ethical Considerations

The model inherits the ethical considerations, biases, and risks of the original Step 3.5 Flash model. Users should perform appropriate safety evaluations and apply content moderation as needed.


Disclaimer

This model is provided as-is, without warranties of any kind. The authors are not responsible for downstream usage or consequences.

This model card does not constitute legal advice. Please consult the original model license before commercial use.

Downloads last month
59
Safetensors
Model size
193B params
Tensor type
F32
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for QuixiAI/Step-3.5-Flash-int4-AutoRound

Quantized
(17)
this model