Embedl Mobilevit Small (Quantized for TensorRT)
Deployable INT8-quantized version of apple/mobilevit-small,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT inference on edge GPUs.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
- Drop-in replacement for
apple/mobilevit-smallin TensorRT pipelines β same input shape (256Γ256), same output semantics. - Validated accuracy within 3.30 pp of the FP32 baseline on ImageNet (see Accuracy table below).
- Quantization-aware training (QAT) further recovers accuracy lost in INT8 conversion by fine-tuning the model with simulated quantization in the forward pass.
- Matches the latency of
trtexec --beston supported NVIDIA hardware while preserving INT8 accuracy (see Performance table below). - Includes both ONNX (for TensorRT) and PT2
(
torch.export-loadable) artifacts plus runnable inference scripts.
Quick Start
pip install huggingface_hub onnxruntime-gpu pillow numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/mobilevit-small-quantized', local_dir='.')"
python infer_trt.py --image path/to/image.jpg # TensorRT
# or
python infer_pt2.py --image path/to/image.jpg # pure PyTorch via torch.export
Files
| File | Purpose |
|---|---|
embedl_mobilevit_small_int8.onnx |
INT8-quantized ONNX with Q/DQ nodes β feed to TensorRT. |
embedl_mobilevit_small_int8.pt2 |
INT8-quantized torch.export ExportedProgram. |
infer_trt.py |
Build a TRT engine from the ONNX and run sample inference. |
infer_pt2.py |
Load the .pt2 with torch.export.load and run sample inference. |
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
NVIDIA Jetson AGX Orin
| Configuration | Mean Latency | Speedup vs FP16 |
|---|---|---|
| TensorRT FP16 | 1.28 ms | 1.00x |
| TensorRT --best (unconstrained) | 1.09 ms | 1.17x |
| Embedl Deploy INT8 | 1.09 ms | 1.17x |
Accuracy
Evaluated on the ImageNet validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.
| Model | Top-1 | Top-5 |
|---|---|---|
apple/mobilevit-small FP32 (ours) |
78.14% | 94.08% |
| Embedl Mobilevit Small INT8 | 74.83% | 92.28% |
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. You can apply the same workflow to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Mobilevit Small License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
- Downloads last month
- -
Model tree for embedl/mobilevit-small-quantized
Base model
apple/mobilevit-small