Model Details
QuixiAI/Step-3.5-Flash-int4-AutoRound is an INT4 GPTQ–quantized version of stepfun-ai/Step-3.5-Flash, generated using Intel’s AutoRound quantization algorithm.
This model was quantized to 4-bit weights (INT4) using AutoRound with iterative optimization to preserve model quality while significantly reducing memory footprint and improving inference efficiency.
- Quantization method: Intel AutoRound
- Weight precision: INT4
- Format: GPTQ
- Optimization iters: 200
Please follow the license and usage restrictions of the original base model.
Author
This quantized model was produced by Eric Hartford, creator of Dolphin and Samantha, founder of QuixiAI, formerly Cognitive Computations.
Quantization Environment
Quantization was performed on an AMD Instinct MI300X server provided by HotAisle (https://hotaisle.xyz).
- Hardware: 2× AMD Instinct MI300X GPUs
- Provider: HotAisle
- Total quantization time: ~1.5 hours
- Parallelism: 2 GPUs
This setup enabled efficient AutoRound optimization for a large-scale MoE model while maintaining practical turnaround time.
Quantization Details
The model was generated using the following command:
auto-round \
--bits 4 \
--iters 200 \
--disable_opt_rtn \
--model_name stepfun-ai/Step-3.5-Flash \
--format auto_gptq \
--output_dir ./Step-3.5-Flash-INT4-GPTQ
AutoRound performs iterative, gradient-based weight rounding to minimize quantization error, providing improved accuracy over naive post-training quantization methods.
For more information:
- AutoRound: https://github.com/intel/auto-round
- Intel Neural Compressor: https://github.com/intel/neural-compressor
How To Use
Transformers (GPTQ)
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "QuixiAI/Step-3.5-Flash-int4-AutoRound"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
prompt = "Explain the significance of the number 42."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use
This model is intended for:
- Efficient local inference of Step 3.5 Flash
- Research and experimentation with INT4 quantization
- Memory-constrained deployments where full-precision weights are impractical
This model is not fine-tuned or aligned beyond the original base model.
Limitations
- Quantization may introduce minor degradation in reasoning or generation quality compared to the full-precision model.
- Performance may vary depending on hardware, kernel support, and inference backend.
- The model may produce incorrect, biased, or unsafe content consistent with limitations of the base model.
Always evaluate the model for your specific use case before deployment.
Ethical Considerations
The model inherits the ethical considerations, biases, and risks of the original Step 3.5 Flash model. Users should perform appropriate safety evaluations and apply content moderation as needed.
Disclaimer
This model is provided as-is, without warranties of any kind. The authors are not responsible for downstream usage or consequences.
This model card does not constitute legal advice. Please consult the original model license before commercial use.
- Downloads last month
- 59
Model tree for QuixiAI/Step-3.5-Flash-int4-AutoRound
Base model
stepfun-ai/Step-3.5-Flash