Papers
arxiv:2605.04045

Audio-Visual Intelligence in Large Foundation Models

Published on May 5
ยท Submitted by
Hao Fei
on May 8
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Audio-Visual Intelligence represents a multidisciplinary field integrating auditory and visual modalities through large foundation models, encompassing tasks from understanding and generation to interaction, with unified taxonomies and methodological foundations.

AI-generated summary

Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling of audio and vision has become increasingly crucial, i.e., not only for understanding but also for controllable generation and reasoning across dynamic, temporally grounded signals. Recent advances, such as Meta MovieGen and Google Veo-3, highlight the growing industrial and academic focus on unified audio-vision architectures that learn from massive multimodal data. However, despite rapid progress, the literature remains fragmented, spanning diverse tasks, inconsistent taxonomies, and heterogeneous evaluation practices that impede systematic comparison and knowledge integration. This survey provides the first comprehensive review of AVI through the lens of large foundation models. We establish a unified taxonomy covering the broad landscape of AVI tasks, ranging from understanding (e.g., speech recognition, sound localization) to generation (e.g., audio-driven video synthesis, video-to-audio) and interaction (e.g., dialogue, embodied, or agentic interfaces). We synthesize methodological foundations, including modality tokenization, cross-modal fusion, autoregressive and diffusion-based generation, large-scale pretraining, instruction alignment, and preference optimization. Furthermore, we curate representative datasets, benchmarks, and evaluation metrics, offering a structured comparison across task families and identifying open challenges in synchronization, spatial reasoning, controllability, and safety. By consolidating this rapidly expanding field into a coherent framework, this survey aims to serve as a foundational reference for future research on large-scale AVI.

Community

Paper submitter

๐ŸŽง๐Ÿ‘€ Audio-Visual Intelligence in Large Foundation Models: A Comprehensive Survey
๐Ÿ“„ arXiv: 2605.04045

We are excited to release what we believe is the first comprehensive survey on Audio-Visual Intelligence (AVI) in the era of large foundation models! ๐Ÿš€

AVI aims to build AI systems that can jointly perceive, generate, and interact through both sound and vision โ€” moving toward truly omni-modal intelligence. ๐ŸŒ๐Ÿค–

In this survey, we systematically organize the rapidly growing AVI landscape around three core pillars:

๐Ÿ”น Perception
Speech recognition, sound localization, video understanding, temporal reasoning, scene understanding, etc.

๐Ÿ”น Generation
Audio-driven video generation, video-to-audio synthesis, talking heads, music/video co-generation, diffusion-based AVI systems, etc.

๐Ÿ”น Interaction
Dialogue systems, embodied agents, conversational AVI, agentic multimodal systems, and interactive world modeling.

โœจ Highlights of this survey:

  • ๐Ÿ“š A unified taxonomy for AVI tasks and paradigms
  • ๐Ÿง  Foundations of audio-visual large models
  • ๐Ÿ”— Cross-modal fusion & tokenization strategies
  • ๐ŸŽฌ Autoregressive & diffusion-based AVI generation
  • ๐Ÿ“Š Comprehensive benchmarks, datasets, and evaluation metrics
  • โš ๏ธ Open challenges: synchronization, controllability, spatial reasoning, safety, and more
  • ๐ŸŒ Curated resources and continuously updated paper list

As audio-visual foundation models rapidly evolve with systems like Meta MovieGen and Google Veo, we hope this survey can serve as a foundational reference for future AVI research and omni-modal AI systems. ๐Ÿ”ฅ

๐Ÿ“– Paper: arXiv:2605.04045
๐ŸŒ Project Page: JavisVerse AVI Survey
๐Ÿ’ป GitHub: Awesome-AVI Repository

#AI #Multimodal #AudioVisual #FoundationModels #LLM #DiffusionModels #ComputerVision #AudioAI #GenerativeAI #MachineLearning #OmniModal #SurveyPaper

the unified taxonomy for avi tasks in the foundation-model era is overdue, and the way the survey threads perception, generation, and interaction into a single framework is refreshing. my main question: in streaming scenarios where audio and video drift, can a single multimodal backbone maintain cross-modal coherence without adding latency or producing desynced outputs? the arxivlens breakdown helped me parse the method details, and it would be helpful if the evaluation protocol explicitly benchmarks alignment under latency and desync (https://arxivlens.com/PaperView/Details/audio-visual-intelligence-in-large-foundation-models-9256-88aebc09). if you could add an ablation showing sensitivity to synchronization latency, e.g., how performance changes as audio-video drift increases, it would really clarify where the gains actually come from.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.04045 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.04045 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.04045 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.