Dynamic Cross-Modal Modeling with an Ultra-Lightweight Architecture for Face Anti-Spoofing.

Nana Li, Jiayu Wang, Zuhe Li, Zhipeng Weng, Yihong Wang, Yushan Pan

The effectiveness of multimodal face anti-spoofing largely depends on the modeling of cross-modal relationships. However, most existing approaches rely on static fusion or implicitly learned feature aggregation, which assumes fixed modality importance, limiting its ability to capture reliability variations across different attack patterns. Under strict computational constraints, achieving effective dynamic cross-modal modeling remains a significant challenge. To address this issue, we propose an ultra-lightweight dynamic cross-modal framework for face anti-spoofing, with ultra-low parameters, FLOPs, latency, memory and high FPS for real-time edge inference. A compact feature extractor is constructed by enhancing ShuffleNetV2 with the Ghost-Generated Shuffle BlockA (GGS-BlockA), which significantly reduces redundant computation while maintaining high discriminative capability. On this basis, a Lightweight Cross-Modal Attention (LCMA) module performs sample-wise dynamic modality reweighting to capture reliability variations among RGB, Depth, and IR modalities. Furthermore, a Lightweight Cross-Modal Fusion (LCMF) module utilizes depth cues as stable guidance to improve cross-modal feature alignment and complementary representation. Experiments on the CASIA-SURF benchmark demonstrate that the proposed method achieves an Average Classification Error Rate (ACER) of 0.064% with only 0.14M parameters and 0.0065G FLOPs. At the strict threshold of TPR@FPR=10-4, a detection rate of 99.86% is obtained, demonstrating strong robustness and generalization capability under extremely low computational cost.

Read on ELI