Embedl Mobilevit Small (Quantized for TensorRT)

Deployable INT8-quantized version of apple/mobilevit-small, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs.

Upstream Model

Highlights

Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
Drop-in replacement for apple/mobilevit-small in TensorRT pipelines — same input shape (256×256), same output semantics.
Validated accuracy within 3.30 pp of the FP32 baseline on ImageNet (see Accuracy table below).
Quantization-aware training (QAT) further recovers accuracy lost in INT8 conversion by fine-tuning the model with simulated quantization in the forward pass.
Matches the latency of trtexec --best on supported NVIDIA hardware while preserving INT8 accuracy (see Performance table below).
Includes both ONNX (for TensorRT) and PT2 (torch.export-loadable) artifacts plus runnable inference scripts.

Quick Start

pip install huggingface_hub onnxruntime-gpu pillow numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/mobilevit-small-quantized', local_dir='.')"
python infer_trt.py --image path/to/image.jpg   # TensorRT
# or
python infer_pt2.py --image path/to/image.jpg   # pure PyTorch via torch.export

Files

File	Purpose
`embedl_mobilevit_small_int8.onnx`	INT8-quantized ONNX with Q/DQ nodes — feed to TensorRT.
`embedl_mobilevit_small_int8.pt2`	INT8-quantized `torch.export` ExportedProgram.
`infer_trt.py`	Build a TRT engine from the ONNX and run sample inference.
`infer_pt2.py`	Load the `.pt2` with `torch.export.load` and run sample inference.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

NVIDIA Jetson AGX Orin

Configuration	Mean Latency	Speedup vs FP16
TensorRT FP16	1.28 ms	1.00x
TensorRT --best (unconstrained)	1.09 ms	1.17x
Embedl Deploy INT8	1.09 ms	1.17x

Accuracy

Evaluated on the ImageNet validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.

Model	Top-1	Top-5
`apple/mobilevit-small` FP32 (ours)	78.14%	94.08%
Embedl Mobilevit Small INT8	74.83%	92.28%

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Mobilevit Small License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: -

Model tree for embedl/mobilevit-small-quantized

Base model

apple/mobilevit-small

Quantized

(7)

this model