AGAV-Rater: Enhancing LMM for
AI-Generated Audio-Visual Quality Assessment

Temporarily Anonymized for Submission
Teaser Image

We propose AGAV-Rater, a LMM-based model that can score AI-generated audio-visual content (AIGAV) across multiple dimensions, including audio quality, content consistency, and overall quality, as well as select the optimal AIGAV. It also evaluates audio and music generated from text on audio quality and content consistency.

Abstract

Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AIGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AIGAVs, such as unrealistic and inconsistent elements.

In this paper, we introduce AGAVQA, the first large-scale AIGAV quality assessment dataset, comprising 3,382 AIGAVs from 16 VTA methods. AGAVQA includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We propose AGAV-Rater, a LMM-based model that can score AIGAVs, audio and music generated from text across multiple dimensions and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA, Text-to-Audio, and Text-to-Music datasets. Subjective tests confirm that AGAV-Rater enhances VTA performance and user experience.

AGAVQA Dataset

Teaser Image
We construct the first AI-generated audio-visual quality assessment dataset, AGAVQA, which can be divided into two subsets: AGAVQA-MOS and AGAVQA-Pair. AGAVQA-MOS contains 3-dimensional Mean Opinion Scores (MOSs) for 3,088 AI-generated audio-visual samples. AGAVQA-Pair includes 75 question-answer pairs, asking which AI-generated audio-visual sample is the best.
Teaser Image
AGAVQA contains 3,382 AI-generated audio-visual samples generated by 16 video-to-audio methods. The AGAVQA-MOS and AGAVQA-Pair subset each utilize 8 video-to-audio methods.

Samples in the AGAVQA-MOS Subset

Audio Quality: 81.33

Audio-visual Content Consistency: 75.47

Overall Audio-visual Quality: 81.22

Audio Quality: 73.31

Audio-visual Content Consistency: 59.18

Overall Audio-visual Quality: 69.78

Audio Quality: 56.59

Audio-visual Content Consistency: 53.38

Overall Audio-visual Quality: 62.00

Audio Quality: 73.03

Audio-visual Content Consistency: 56.54

Overall Audio-visual Quality: 55.78

Audio Quality: 56.83

Audio-visual Content Consistency: 65.74

Overall Audio-visual Quality: 52.20

Audio Quality: 34.85

Audio-visual Content Consistency: 34.20

Overall Audio-visual Quality: 35.73

Samples in the AGAVQA-Pair Subset

Question: Which audio best matches this video in terms of audio content, quality and rhythm?

Audio 1

Audio 2

Audio 3

Answer: Audio 2.

Question: Which audio best matches this video in terms of audio content, quality and rhythm?

Audio 1

Audio 2

Audio 3

Answer: Audio 1.

AGAV-Rater Overview

Teaser Image
We propose a LMM-based AI-generated audio-visual quality assessment model AGAV-Rater. AGAV-Rater can predict multi-dimensional scores for AI-generated audio-visual content (AIGAV), text-to-audio, and text-to-music, and assist video-to-audio methods in selecting the optimal AIGAV samples. Taking audio-visual input as an example, the video is first converted into a 1fps image sequence. Then, the image sequence and audio signal are separately encoded by the video encoder and audio encoder. After encoding, they are projected into the same vector space through the video projection and audio projection, and input into the large language model together with text embedding. Finally, we directly regress the LMM's last hidden states to output multi-dimensional numerical scores.

Table of Contents

  • Performance Comparisons of Multi-dimensional Scoring
  • Accuracy Comparisons of Optimal AIGAV Selection
  • Presentation of Optimal AIGAV Selected by AGAV-Rater
  • We collect 230 silent AI-generated videos from T2V-CompBench and Sora. For each video, we generate 2 AIGAVs using Elevenlabs. We then utilize AGAV-Rater to select the higher-quality AGAV. 10 subjects are invited to watch and listen to 2 AGAVs. 80% of them preferred the AIGAVs selected by AGAV-Rater for its better quality. Here, we display some AGAVs judged by AGAV-Rater.

    Fantasy

    High Quality

    Low Quality

    High Quality

    Low Quality

    Instrument
    Water
    Animal
    Object
    People
    Scenery
    Vehicle
    Cooking
    Sea
    Fire
    Sea
    +
    Animal

    High Quality

    Low Quality

    Scenery
    +
    People

    High Quality

    Low Quality

    We also generate 5 AGAVs for 10 AIGC video using Elevenlabs and then use AGAV-Rater to rank them from highest to lowest quality.

    AGAV-Rater Ranking Results

    Rank 1

    Rank 2

    Rank 3

    Rank 4

    Rank 5