Harmonizer | Model Card

Paper | Project Page | Code | Model | Data

Description

Harmonizer is a single-step image diffusion model trained as an online generative enhancer for neural-reconstruction image and video renderings. It transforms imperfect novel-view renderings produced by Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) reconstructions into temporally consistent outputs that are closer to real captures, while correcting illumination, shadow, and reconstruction-artifact issues that arise when dynamic objects are composited into reconstructed scenes.

Harmonizer supports two operation modes:

Offline mode: Used during the reconstruction phase to clean up pseudo-training views rendered from the reconstruction, then distill them back into 3D. This enhances underconstrained regions and improves overall 3D representation quality.
Online mode: Acts as a single-step neural enhancer during simulation or inference. It harmonizes color and lighting, reconstructs missing or inconsistent shadows for inserted dynamic objects, and removes residual reconstruction artifacts from imperfect 3D supervision and current reconstruction-model capacity limits.

Harmonizer is designed as a single model compatible with both NeRF and 3DGS representations. The model was trained on data curated with 3DGUT-based reconstructions and is adaptable to Gaussian Splatting scenes.

License/Terms of Use

Governing Terms

Use of this model is governed by the NVIDIA Open Model License Agreement.

Deployment Geography: Global

Release Management

The model artifacts are released in this repository. Training and inference code is available from the Harmonizer GitHub repository. The associated dataset is available from nvidia/Harmonizer-Dataset.

Use Case

Harmonizer is intended for Physical AI developers looking to enhance and harmonize neural-reconstruction pipelines for autonomous-vehicle simulation. The model takes an image or image sequence as input and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.

Benchmark Results

Benchmarks were evaluated on 864 images from NDAS MLMCF and ParkNet training sessions. PSNR is higher-is-better; LPIPS and FID are lower-is-better.

Model	PSNR	LPIPS	FID
Difix3D+	28.33	0.16	54.20
Fixer: cosmos_3dgut	30.99	0.16	41.87
Harmonizer: non-temporal mode (fastest runtime; `--enable-harmonizer` in NuRec gRPC) Inference enabled through the following checkpoints: `harmonizer_nontemporal.pt` `diffusion_harmonizer.pkl` with `--nontemporal` flag	30.48	0.16	32.05
Harmonizer: temporal mode (highest quality output) Inference enabled through the following checkpoint: `diffusion_harmonizer.pkl`	31.06	0.15	27.40

Release Date

V1: June 2026

Reference(s)

Model Architecture

Architecture Type: Diffusion Transformer

Network Architecture: Diffusion Transformer, based on Cosmos Predict2 0.6B, post-trained as a single-step, temporally conditioned image-to-image enhancer for neural-reconstruction renderings.

The project page describes the backbone as the CosmosPredict2 0.6B text-to-image model fine-tuned on real-world and simulation training pairs from scalable data-curation pipelines for color and lighting harmonization, shadow correction, and artifact correction.

Model Input

Input Type(s): Image / Image sequence

Input Format: Red, Green, Blue (RGB)

Input Parameters: Two-Dimensional (2D)

Other Properties Related to Input: Specific resolution: 576 px x 1024 px

Model Output

Output Type(s): Image

Output Format: Red, Green, Blue (RGB)

Output Parameters: Two-Dimensional (2D)

Other Properties Related to Output: Specific resolution: 576 px x 1024 px

Software Integration

Runtime Engine(s): PyTorch

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Blackwell

Preferred/Supported Operating System(s): Linux

NVIDIA AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA hardware and software frameworks such as CUDA libraries, the model can achieve faster training and inference times compared to CPU-only systems.

Model Version

Harmonizer-cosmos-0.6B

We release two checkpoints specified below.

diffusion_harmonizer.pkl — The temporally-conditioned Harmonizer checkpoint reported in the DiffusionHarmonizer paper. Recommended when temporal coherence across consecutive rendered frames is required (e.g., video-style novel-view simulation). The model supports non-temporally faster conditioned inference mode via --nontemporal flag.

Inference speed on H100:
- full model (default): 212 ms / 576 x 1024 px image
- --nontemporal mode: 28 ms / 576 × 1024 px image
harmonizer_nontemporal.pt — Exported JIT model for non-temporal, per-image inference. The checkpoint does not support conditioning on previous frames and corresponds to diffusion_harmonizer.pkl with --nontemporal flag. Recommended for per-image enhancement use cases where neighboring-frame context is unavailable or unnecessary, or where speed is critical.

Inference speed on H100: 28 ms / 576 × 1024 px image.

Pretrained checkpoints are hosted on Hugging Face under nvidia/Harmonizer. To download all released checkpoints into a local models/ directory:

hf download nvidia/Harmonizer --local-dir models

Refer to the code release for the exact inference entry points and configuration files associated with each checkpoint. By default, the model runs in temporal mode. To run in non-temporal mode, add the --nontemporal flag. Refer to the code release for the exact inference entry points.

Repository Contents

This repository contains the following model artifact files:

diffusion_harmonizer.pkl — DiffusionHarmonizer paper temporal checkpoint
harmonizer_nontemporal.pt — non-temporal single-frame checkpoint (PyTorch .pt format)

For more details please see the Model Version section.

Training, Testing, and Evaluation Datasets

Harmonizer was trained, tested, and evaluated using an internal dataset of curated synthetic–real image pairs constructed from five complementary curation pipelines (ISP modification, relighting, asset re-insertion, PBR shadow simulation, and novel-view artifact correction), where 80% of the data was used for training, 10% for evaluation, and 10% for testing. The total volume of training data amounted to ~1 million pairs. Training data will be released at nvidia/Harmonizer-Dataset.

NVIDIA Internal AV Dataset

Data Collection Method: Sensors

Labeling Method by Dataset: Human

Properties: The dataset contains autonomous-driving images and videos captured by NVIDIA vehicles. It is collected by autonomous-driving vehicles and used as the source data from which the synthetic-real training pairs are derived.

Inference

Engine: PyTorch>=2.0.0

Test Hardware: We tested on H100:

diffusion_harmonizer.pkl

full model (default): 212 ms / 576 x 1024 px image
--nontemporal mode: 28 ms / 576 × 1024 px image

harmonizer_nontemporal.pt

28 ms / 576 × 1024 px image

Known Technical Limitations

The reconstruction relies on the quality and consistency of input images and camera calibrations; deficiencies in these areas can negatively impact the final output.

Known Risk(s)

The model is not guaranteed to fix 100% of image artifacts. Please verify generated scenarios are context and use appropriate.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and has established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with the terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the ModelCard++ Explainability, Bias, Safety & Security, and Privacy subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

ModelCard++

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing	None
Measures taken to mitigate against unwanted bias	None

Explainability

Field	Response
Intended Domain	Advanced Driver Assistance Systems
Model Type	Image-to-Image
Intended Users	Autonomous Vehicles developers enhancing and harmonizing Neural Reconstruction pipelines.
Output	Image
Describe how the model works	The model takes as input an image and outputs a harmonized image with corrected color, lighting, shadows, and reduced reconstruction artifacts.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of	None
Technical Limitations	The reconstruction relies on the quality and consistency of input images and camera calibrations; any deficiencies in these areas can negatively impact the final output.
Verified to have met prescribed NVIDIA quality standards	Yes
Performance Metrics	FID (Frechet Inception Distance); PSNR (Peak Signal-to-Noise Ratio); LPIPS (Learned Perceptual Image Patch Similarity)
Potential Known Risks	The model is not guaranteed to fix 100% of the image artifacts. Please verify the generated scenarios are context- and use-appropriate.
Licensing	Use of this model is governed by the NVIDIA Open Model License Agreement.

Privacy

Field	Response
Generatable or reverse engineerable personal data?	No
Personal data used to create this model?	No
How often is the dataset reviewed?	Before release
Is there provenance for all datasets used in training?	Yes
Does data labeling (annotation, metadata) comply with privacy laws?	Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made?	Yes

Safety & Security

Field	Response
Model Application(s)	Image Enhancement
List types of specific high-risk AI systems, if any, in which the model can be integrated	The model can be used to develop Autonomous Vehicles stacks that can be integrated inside vehicles. The Harmonizer model should not be deployed in a vehicle.
Describe the life critical impact, if present	N/A - The model should not be deployed in a vehicle and will not perform life-critical tasks.
Use Case Restrictions	Use of this model is governed by the NVIDIA Open Model License Agreement.
Model and dataset restrictions	The Principle of Least Privilege (PoLP) is applied, limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints are adhered to.

Downloads last month: 63

Paper for nvidia/Harmonizer

DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer

Paper • 2602.24096 • Published Mar 5