MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities

TL;DR

We present MultimodalStudio, a project that includes MMS-DATA and MMS-FW. MMS-DATA is a geometrically calibrated multi-view multi-sensor dataset; MMS-FW is a multimodal NeRF framework that supports mosaicked, demosaicked, distorted, and undistorted frames of different modalities. We conducted in depth investigations proving that using multiple imaging modalities improves the novel view rendering quality of each single involved modality.

Abstract

Neural Radiance Fields (NeRF) have shown impressive performances in the rendering of 3D scenes from arbitrary viewpoints. While RGB images are widely preferred for training volume rendering models, the interest in other radiance modalities is also growing. However, the capability of the underlying implicit neural models to learn and transfer information across heterogeneous imaging modalities has seldom been explored, mostly due to the limited training data availability.

For this purpose, we present MultimodalStudio (MMS): it encompasses MMS-DATA and MMS-FW. MMS-DATA is a multimodal multi-view dataset containing 32 scenes acquired with 5 different imaging modalities: RGB, monochrome, near-infrared, polarization and multispectral. MMS-FW is a novel modular multimodal NeRF framework designed to handle multimodal raw data and able to support an arbitrary number of multi-channel devices.

Through extensive experiments, we demonstrate that MMS-FW trained on MMS-DATA can transfer information between different imaging modalities and produce higher quality renderings than using single modalities alone. We publicly release the dataset and the framework, to promote the research on multimodal volume rendering and beyond.

Multi-sensor Acquisition Setup

The sensors where mounted on a custom-built rig.
We employed 5 different imaging sensors:

RGB: Basler acA2500-14g
Monochrome (Mono): Basler acA2500-14gm
Near-infrared (NIR): Basler acA1300-60gmNIR
Polarization (Pol): FLIR BFS-U3-51S5P-C
Multispectral (MS): Silios CMS-C1

All the sensors are stereo calibrated with respect to the RGB camera, considered as reference camera.

Framework Architecture

We decoupled the density from the radiance estimation by initializing two separate modules. Both the density and the radiance estimations employ implicit representations shared between modalities because they capture overlapping spectral bands, thus share a relevant part of information.
This architecture allows the model to estimate any channel of any training modality from whatever viewpoint, thus producing perfectly aligned multimodal renderings.

Quantitative Results

We show in the Table three tests:

Single-modality: training and test on a single modality.
3-modality: training and test on three modalities.
5-modality: training and test on all the modalities.

We observe that the additional modalities always improve the PSNR gain with respect to the single-modality case. We conclude that including frames of other modalities in the training provides complementary information that helps the NeRF to better estimate the multimodal radiance fields.
For a further analysis, refer to the paper.

Unbalanced Combinations of Modalities

Let's consider the scenario with an unbalanced number of frames per modality. We trained a 2-modality model with:

45 RGB frames
1, 3, 5, 10, 25 and 45 MS frames

The obtained PSNR is plotted as a function of the number of additional modality viewpoints. The results show that the additional modality renderings are always more accurate than the ones obtained by the single-modality training. Moreover, it is sufficient to have more than 5 frames of a second modality to also improve the RGB rendering quality.
These results show that the model can efficiently transfer information from one modality to another.

BibTeX

@inproceedings{lincetto2025multimodalstudio, author = {Lincetto, Federico and Agresti, Gianluca and Rossi, Mattia and Zanuttigh, Pietro}, title = {MultimodalStudio: A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2025}, }

MultimodalStudio:
A Heterogeneous Sensor Dataset and Framework for Neural Rendering across Multiple Imaging Modalities