MultimodalStudio:
A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

1Media Lab - University of Padova, 2Sony Europe B.V.

CVPR 2025

TL;DR

We present MultimodalStudio, a project that includes MMS-DATA and MMS-FW. MMS-DATA is a geometrically calibrated multi-view multi-sensor dataset; MMS-FW is a multimodal NeRF framework that supports mosaicked, demosaicked, distorted, and undistorted frames of different modalities. We conducted in depth investigations proving that using multiple imaging modalities improves the novel view rendering quality of each single involved modality.




MultimodalStudio visual abstract.

Overview of the proposed framework. MMS-FW exploits unaligned multimodal frames acquired by different sensors to render perfectly aligned novel views for each modality.
The mosaick pattern for each modality is shown in the top corners

Abstract

Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability.

For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices.

Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.

MMS-DATA scenes preview. It consists of 32 object-centric scenes acquired with 5 different imaging modalities: RGB, Monochrome (Mono), Near Infrared (NIR), Polarization (Pol) and Multispectral (MS). The objects are made of diffusive, glossy, reflective, and transparent materials, such as plastic, metal, wood, organic, cloth, paper, and glass.

Multi-sensor Acquisition Setup

Multi-sensor acquisition setup.

The sensors where mounted on a custom-built rig.
We employed 5 different imaging sensors:

  • RGB: Basler acA2500-14g
  • Monochrome (Mono): Basler acA2500-14gm
  • Near-infrared (NIR): Basler acA1300-60gmNIR
  • Polarization (Pol): FLIR BFS-U3-51S5P-C
  • Multispectral (MS): Silios CMS-C1
All the sensors are stereo calibrated with respect to the RGB camera, considered as reference camera.

Framework Architecture

Multi-sensor acquisition setup.

We decoupled the density from the radiance estimation by initializing two separate modules. Both the density and the radiance estimations employ implicit representations shared between modalities because they capture overlapping spectral bands, thus share a relevant part of information.
This architecture allows the model to estimate any channel of any training modality from whatever viewpoint, thus producing perfectly aligned multimodal renderings.

Quantitative Results

Multi-sensor acquisition setup.

We show in the Table three tests:

  • Single-modality: training and test on a single modality.
  • 3-modality: training and test on three modalities.
  • 5-modality: training and test on all the modalities.
We observe that the additional modalities always improve the PSNR gain with respect to the single-modality case. We conclude that including frames of other modalities in the training provides complementary information that helps the NeRF to better estimate the multimodal radiance fields.
For a further analysis, refer to the paper.

Unbalanced Combinations of Modalities

Multi-sensor acquisition setup.

Let's consider the scenario with an unbalanced number of frames per modality. We trained a 2-modality model with:

  • 45 RGB frames
  • 1, 3, 5, 10, 25 and 45 MS frames
The obtained PSNR is plotted as a function of the number of additional modality viewpoints. The results show that the additional modality renderings are always more accurate than the ones obtained by the single-modality training. Moreover, it is sufficient to have more than 5 frames of a second modality to also improve the RGB rendering quality.
These results show that the model can efficiently transfer information from one modality to another.

Qualitative Example

Multi-sensor acquisition setup.

Aligned Multimodal Rendering

BibTeX

@inproceedings{lincetto2025multimodalstudio,
  author    = {Lincetto, Federico and Agresti, Gianluca and Rossi, Mattia and Zanuttigh, Pietro},
  title     = {MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
}