MGCFI-Net: Multi-scale globally aware feature learning with cross-view feature interaction for multi-view stereo.

Ming Han, Hui Yin, Aixin Chong, Hua Huang

Recent learning-based multi-view stereo (MVS) methods have achieved notable progress, yet accurate and complete reconstruction in weakly textured regions remains challenging. To address this issue, we present MGCFI-Net (Multi-Scale Globally Aware Cross-View Feature Interaction Network), which is able to learn multi-scale globally aware representations and explicitly model cross-view feature interactions under supervision. Specifically, MGCFI-Net incorporates a Multi-Scale Hierarchical Perception (MSHP) module that integrates the standard convolution, parallel dilated convolutions, and a Swin Transformer to progressively capture discriminative features with global contextual awareness. For multi-scale feature fusion, we use patch-expanding upsampling to preserve semantics, improving cross-scale integration. To enhance cross-view geometric consistency, a Cross-View Feature Interaction (CVFI) module is introduced, in which epipolar self-attention is applied within each source feature and epipolar cross-attention is performed between each source feature and the reference feature, enabling more reliable reference-source feature matching. Furthermore, we incorporate Cross-View Feature Similarity Supervision (CFSS), which enforces feature-level consistency through geometric warping and improves cross-view alignment and matching robustness under challenging conditions. Extensive evaluations on the DTU, Tanks & Temples, and ETH3D benchmarks demonstrate that MGCFI-Net achieves superior reconstruction quality and establishes new state-of-the-art (SOTA) performance, validating its effectiveness and generalization capability.

Read on ELI