AGAV-Rater: Enhancing LMM for
AI-Generated Audio-Visual Quality Assessment

🎉 Accepted by ICML 2025

Yuqin Cao¹, Xiongkuo Min¹^#, Yixuan Gao¹, Wei Sun², Guangtao Zhai¹^# ¹Shanghai Jiao Tong University, ²East China Normal University, ^#Corresponding author(s).

arXiv Code Dataset Model

We propose AGAV-Rater, a LMM-based model that can score AI-generated audio-visual content (AGAV) across multiple dimensions, including audio quality, content consistency, and overall quality, as well as select the optimal AGAV. It also evaluates audio and music generated from text on audio quality and content consistency.

Abstract

Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements.

In this paper, we introduce AGAVQA-3k, the first large-scale AGAV quality assessment dataset, comprising 3,382 AGAVs from 16 VTA methods. AGAVQA-3k includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We propose AGAV-Rater, a LMM-based model that can score AGAVs, audio and music generated from text across multiple dimensions and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA-3k, Text-to-Audio, and Text-to-Music datasets. Subjective tests confirm that AGAV-Rater enhances VTA performance and user experience.

AGAVQA-3k Dataset

We construct the first AI-generated audio-visual quality assessment dataset, AGAVQA-3k, which can be divided into two subsets: AGAVQA-MOS and AGAVQA-Pair. AGAVQA-MOS contains 3-dimensional Mean Opinion Scores (MOSs) for 3,088 AI-generated audio-visual samples. AGAVQA-Pair includes 75 question-answer pairs, asking which AI-generated audio-visual sample is the best.

AGAVQA-3k contains 3,382 AI-generated audio-visual samples generated by 16 video-to-audio methods. The AGAVQA-MOS and AGAVQA-Pair subset each utilize 8 video-to-audio methods.

Samples in the AGAVQA-MOS Subset

Audio Quality: 81.33

Audio-visual Content Consistency: 75.47

Overall Audio-visual Quality: 81.22

Audio Quality: 73.31

Audio-visual Content Consistency: 59.18

Overall Audio-visual Quality: 69.78

Audio Quality: 56.59

Audio-visual Content Consistency: 53.38

Overall Audio-visual Quality: 62.00

Audio Quality: 73.03

Audio-visual Content Consistency: 56.54

Overall Audio-visual Quality: 55.78

Audio Quality: 56.83

Audio-visual Content Consistency: 65.74

Overall Audio-visual Quality: 52.20

Audio Quality: 34.85

Audio-visual Content Consistency: 34.20

Overall Audio-visual Quality: 35.73

Samples in the AGAVQA-Pair Subset

Question: Which audio best matches this video in terms of audio content, quality and rhythm?

Audio 1

Audio 2

Audio 3

Answer: Audio 2.

Question: Which audio best matches this video in terms of audio content, quality and rhythm?

Audio 1

Audio 2

Audio 3

Answer: Audio 1.

AGAV-Rater Overview

We propose a LMM-based AI-generated audio-visual quality assessment model AGAV-Rater. AGAV-Rater can predict multi-dimensional scores for AI-generated audio-visual content (AGAV), text-to-audio, and text-to-music, and assist video-to-audio methods in selecting the optimal AGAV samples. Taking audio-visual input as an example, the video is first converted into a 1fps image sequence. Then, the image sequence and audio signal are separately encoded by the video encoder and audio encoder. After encoding, they are projected into the same vector space through the video projection and audio projection, and input into the large language model together with text embedding. Finally, we directly regress the LMM's last hidden states to output multi-dimensional numerical scores.

Performance Comparisons of Multi-dimensional Scoring

Accuracy Comparisons of Optimal AGAV Selection

Presentation of Optimal AGAV Selected by AGAV-Rater

Presentation of Scores Rated by AGAV-Rater

Performance Comparisons of Multi-dimensional Scoring

Performance comparisons on the AGAVQA-MOS subset from three dimensions.

Performance comparisons on the Text-to-Audio and Text-to-Music datasets from three dimensions.

AGAV-Rater demonstrates the best performance across all three datasets. Especially in evaluating the content consistency dimension, the SRCC metric shows an 11%, 7%, and 3% improvement in the AGAVQA-MOS, Text-to-Audio, and Text-to-Music datasets, respectively.

Accuracy Comparisons of Optimal AGAV Selection

Accuracy Comparisons between AGAV-Rater and proprietary LMMs on the AGAV-Pair subset

We train AGAV-Rater on the AGAVQA-MOS subset and test on the unseen AGAVQA-Pair subset. AGAV-Rater achieves the highest accuracy across all 8 video-to-audio methods, demonstrating its generalization ability and robustness.

Each dimension in the radar chart represents the answer accuracy with the correct answer corresponding to the current VTA method.

Presentation of Optimal AGAV Selected by AGAV-Rater

We collect 230 silent AI-generated videos from T2V-CompBench and Sora. For each video, we generate 2 AGAVs using Elevenlabs. We then utilize AGAV-Rater to select the higher-quality AGAV. 10 subjects are invited to watch and listen to 2 AGAVs. 80% of them preferred the AGAVs selected by AGAV-Rater for its better quality. Here, we display some AGAVs judged by AGAV-Rater.

Fantasy

High Quality