Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch β†’

Embedl Mobilevit Small (Quantized for TensorRT)

Deployable INT8-quantized version of apple/mobilevit-small, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs.

Upstream Model

Open apple/mobilevit-small in hfviewer

Highlights

  • Mixed-precision INT8/FP16 quantization with hardware-aware optimizations from embedl-deploy.
  • Drop-in replacement for apple/mobilevit-small in TensorRT pipelines β€” same input shape (256Γ—256), same output semantics.
  • Validated accuracy within 3.30 pp of the FP32 baseline on ImageNet (see Accuracy table below).
  • Quantization-aware training (QAT) further recovers accuracy lost in INT8 conversion by fine-tuning the model with simulated quantization in the forward pass.
  • Matches the latency of trtexec --best on supported NVIDIA hardware while preserving INT8 accuracy (see Performance table below).
  • Includes both ONNX (for TensorRT) and PT2 (torch.export-loadable) artifacts plus runnable inference scripts.

Quick Start

pip install huggingface_hub onnxruntime-gpu pillow numpy
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/mobilevit-small-quantized', local_dir='.')"
python infer_trt.py --image path/to/image.jpg   # TensorRT
# or
python infer_pt2.py --image path/to/image.jpg   # pure PyTorch via torch.export

Files

File Purpose
embedl_mobilevit_small_int8.onnx INT8-quantized ONNX with Q/DQ nodes β€” feed to TensorRT.
embedl_mobilevit_small_int8.pt2 INT8-quantized torch.export ExportedProgram.
infer_trt.py Build a TRT engine from the ONNX and run sample inference.
infer_pt2.py Load the .pt2 with torch.export.load and run sample inference.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

MobileViT-Small benchmark on NVIDIA Jetson AGX Orin

NVIDIA Jetson AGX Orin

Configuration Mean Latency Speedup vs FP16
TensorRT FP16 1.28 ms 1.00x
TensorRT --best (unconstrained) 1.09 ms 1.17x
Embedl Deploy INT8 1.09 ms 1.17x

Accuracy

Evaluated on the ImageNet validation split. The quantized model retains nearly all of the FP32 accuracy with a small tolerance.

Model Top-1 Top-5
apple/mobilevit-small FP32 (ours) 78.14% 94.08%
Embedl Mobilevit Small INT8 74.83% 92.28%

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β†’ TensorRT deployment library. You can apply the same workflow to your own models β€” see the documentation for installation and usage.

License

Component License
Optimized model artifacts (this repo) Embedl Models Community Licence v1.0 β€” no redistribution as a hosted service
Upstream architecture and weights Mobilevit Small License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips β€” bring them all.
Join our Discord β†’
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for embedl/mobilevit-small-quantized

Quantized
(7)
this model