Vision-Language Collaborative Representation Learning for Action Quality Assessment.

Kumie Gedamu, Yanli Ji, Wangmeng Zuo, Jamal Bentahar, Yang Yang, Jie Shao, Heng Tao Shen

Action Quality Assessment (AQA) has gained significant attention due to its potential real-world applications, which require a fine-grained understanding of action sequences. Recent works have attempted to utilize multimodal video features and address some existing challenges. However, these approaches primarily focus on leveraging textual information from language models only, leading to instability and suboptimal performance due to directional bias in a vision-language joint embedding space. To tackle these issues, we propose a Vision-Language Collaboration Representation Learning approach (VLC-Net) to understand fine-grained action sequences and create a unified feature representation along with their temporal dependencies for accurate AQA score prediction. Specifically, we design a bidirectional knowledge distillation operation to perform collaboration learning between vision-language pre-trained knowledge and visual action knowledge for fine-grained action feature learning. Furthermore, we design vision-language alignment guidance to explicitly align action features with the same action semantics across modalities, thereby unifying their joint representation. Leveraging these aligned features, we propose multimodal contrastive learning to relate modalities and align subactions with textual descriptions, ensuring accurate action representation. We conduct experiments on the FineDiving, MTL-AQA, FineFS, and Fis-V datasets, demonstrating the effectiveness of our approach, which outperforms state-of-the-art methods.

Read on ELI