- Harmonizer | Model Card
Harmonizer | Model Card
Paper | Project Page | Code | Model | Data
Description
Harmonizer is a single-step image diffusion model trained as an online generative enhancer for neural-reconstruction image and video renderings. It transforms imperfect novel-view renderings produced by Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) reconstructions into temporally consistent outputs that are closer to real captures, while correcting illumination, shadow, and reconstruction-artifact issues that arise when dynamic objects are composited into reconstructed scenes.
Harmonizer supports two operation modes:
- Offline mode: Used during the reconstruction phase to clean up pseudo-training views rendered from the reconstruction, then distill them back into 3D. This enhances underconstrained regions and improves overall 3D representation quality.
- Online mode: Acts as a single-step neural enhancer during simulation or inference. It harmonizes color and lighting, reconstructs missing or inconsistent shadows for inserted dynamic objects, and removes residual reconstruction artifacts from imperfect 3D supervision and current reconstruction-model capacity limits.
Harmonizer is designed as a single model compatible with both NeRF and 3DGS representations. The model was trained on data curated with 3DGUT-based reconstructions and is adaptable to Gaussian Splatting scenes.
License/Terms of Use
Governing Terms
Use of this model is governed by the NVIDIA Open Model License Agreement.
Deployment Geography: Global
Release Management
The model artifacts are released in this repository. Training and inference code is available from the Harmonizer GitHub repository. The associated dataset is available from nvidia/Harmonizer-Dataset.
Use Case
Harmonizer is intended for Physical AI developers looking to enhance and harmonize neural-reconstruction pipelines for autonomous-vehicle simulation. The model takes an image or image sequence as input and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.
Benchmark Results
Benchmarks were evaluated on 864 images from NDAS MLMCF and ParkNet training sessions. PSNR is higher-is-better; LPIPS and FID are lower-is-better.
| Model | PSNR | LPIPS | FID |
|---|---|---|---|
| Difix3D+ | 28.33 | 0.16 | 54.20 |
| Fixer: cosmos_3dgut | 30.99 | 0.16 | 41.87 |
| Harmonizer: non-temporal mode (fastest runtime; --enable-harmonizer in NuRec gRPC)Inference enabled through the following checkpoints: harmonizer_nontemporal.ptdiffusion_harmonizer.pkl with --nontemporal flag |
30.48 | 0.16 | 32.05 |
| Harmonizer: temporal mode (highest quality output) Inference enabled through the following checkpoint: diffusion_harmonizer.pkl |
31.06 | 0.15 | 27.40 |
Release Date
V1: June 2026
Reference(s)
- DiffusionHarmonizer paper
- DiffusionHarmonizer project page
- Harmonizer training and inference code
- Harmonizer dataset
Model Architecture
Architecture Type: Diffusion Transformer
Network Architecture: Diffusion Transformer, based on Cosmos Predict2 0.6B, post-trained as a single-step, temporally conditioned image-to-image enhancer for neural-reconstruction renderings.
The project page describes the backbone as the CosmosPredict2 0.6B text-to-image model fine-tuned on real-world and simulation training pairs from scalable data-curation pipelines for color and lighting harmonization, shadow correction, and artifact correction.
Model Input
Input Type(s): Image / Image sequence
Input Format: Red, Green, Blue (RGB)
Input Parameters: Two-Dimensional (2D)
Other Properties Related to Input: Specific resolution: 576 px x 1024 px
Model Output
Output Type(s): Image
Output Format: Red, Green, Blue (RGB)
Output Parameters: Two-Dimensional (2D)
Other Properties Related to Output: Specific resolution: 576 px x 1024 px
Software Integration
Runtime Engine(s): PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
- NVIDIA Blackwell
Preferred/Supported Operating System(s): Linux
NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and software frameworks such as CUDA libraries, the model can achieve faster training and inference times compared to CPU-only systems.
Model Version
Harmonizer-cosmos-0.6B
We release two checkpoints specified below.
diffusion_harmonizer.pkl— The temporally-conditioned Harmonizer checkpoint reported in the DiffusionHarmonizer paper. Recommended when temporal coherence across consecutive rendered frames is required (e.g., video-style novel-view simulation). The model supports non-temporally faster conditioned inference mode via--nontemporalflag.Inference speed on H100:
- full model (default): 212 ms / 576 x 1024 px image
--nontemporalmode: 28 ms / 576 × 1024 px image
harmonizer_nontemporal.pt— Exported JIT model for non-temporal, per-image inference. The checkpoint does not support conditioning on previous frames and corresponds todiffusion_harmonizer.pklwith--nontemporalflag. Recommended for per-image enhancement use cases where neighboring-frame context is unavailable or unnecessary, or where speed is critical.Inference speed on H100: 28 ms / 576 × 1024 px image.
Pretrained checkpoints are hosted on Hugging Face under nvidia/Harmonizer. To download all released checkpoints into a local models/ directory:
hf download nvidia/Harmonizer --local-dir models
Refer to the code release for the exact inference entry points and configuration files associated with each checkpoint. By default, the model runs in temporal mode. To run in non-temporal mode, add the --nontemporal flag. Refer to the code release for the exact inference entry points.
Repository Contents
This repository contains the following model artifact files:
diffusion_harmonizer.pkl— DiffusionHarmonizer paper temporal checkpointharmonizer_nontemporal.pt— non-temporal single-frame checkpoint (PyTorch.ptformat)
For more details please see the Model Version section.
Training, Testing, and Evaluation Datasets
Harmonizer was trained, tested, and evaluated using an internal dataset of curated synthetic–real image pairs constructed from five complementary curation pipelines (ISP modification, relighting, asset re-insertion, PBR shadow simulation, and novel-view artifact correction), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. The total volume of training data amounted to ~1 million pairs. Training data will be released at nvidia/Harmonizer-Dataset.
NVIDIA Internal AV Dataset
Data Collection Method: Sensors
Labeling Method by Dataset: Human
Properties: The dataset contains autonomous-driving images and videos captured by NVIDIA vehicles. It is collected by autonomous-driving vehicles and used as the source data from which the synthetic-real training pairs are derived.
Inference
Engine: PyTorch>=2.0.0
Test Hardware: We tested on H100:
diffusion_harmonizer.pkl
- full model (default): 212 ms / 576 x 1024 px image
--nontemporalmode: 28 ms / 576 × 1024 px image
harmonizer_nontemporal.pt
- 28 ms / 576 × 1024 px image
Known Technical Limitations
The reconstruction relies on the quality and consistency of input images and camera calibrations; deficiencies in these areas can negatively impact the final output.
Known Risk(s)
The model is not guaranteed to fix 100% of image artifacts. Please verify generated scenarios are context and use appropriate.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with the terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the ModelCard++ Explainability, Bias, Safety & Security, and Privacy subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.
ModelCard++
Bias
| Field | Response |
|---|---|
| Participation considerations from adversely impacted groups protected classes in model design and testing | None |
| Measures taken to mitigate against unwanted bias | None |
Explainability
| Field | Response |
|---|---|
| Intended Domain | Advanced Driver Assistance Systems |
| Model Type | Image-to-Image |
| Intended Users | Autonomous Vehicles developers enhancing and harmonizing Neural Reconstruction pipelines. |
| Output | Image |
| Describe how the model works | The model takes as input an image and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of | None |
| Technical Limitations | The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output. |
| Verified to have met prescribed NVIDIA quality standards | Yes |
| Performance Metrics | FID (Frechet Inception Distance); PSNR (Peak Signal-to-Noise Ratio); LPIPS (Learned Perceptual Image Patch Similarity) |
| Potential Known Risks | The model is not guaranteed to fix 100% of the image artifacts. Please verify the generated scenarios are context- and use-appropriate. |
| Licensing | Use of this model is governed by the NVIDIA Open Model License Agreement. |
Privacy
| Field | Response |
|---|---|
| Generatable or reverse engineerable personal data? | No |
| Personal data used to create this model? | No |
| How often is the dataset reviewed? | Before release |
| Is there provenance for all datasets used in training? | Yes |
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | Yes |
Safety & Security
| Field | Response |
|---|---|
| Model Application(s) | Image Enhancement |
| List types of specific high-risk AI systems, if any, in which the model can be integrated | The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Harmonizer model should not be deployed in a vehicle. |
| Describe the life critical impact, if present | N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks. |
| Use Case Restrictions | Use of this model is governed by the NVIDIA Open Model License Agreement. |
| Model and dataset restrictions | The Principle of Least Privilege (PoLP) is applied, limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints are adhered to. |
- Downloads last month
- 63