FilmComposer:
LLM-Driven Music Production for Silent Film Clips

CVPR 2025
Zhifeng Xie1,2, Qile He1, Youjia Zhu1, Qiwei He1, Mengtian Li1,2
1Shanghai University
2Shanghai Engineering Research Center of Motion Picture Special Effects

Results of FilmComposer

Comparison with Other Models

CMT

Video2Music

M2UGen

VidMuse

MusicGen

FilmComposer

Abstract

Abstract Image

In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film—audio quality, musicality, and musical development—and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development.

Framework

Framework Image

FilmComposer consists of three main modules——visual processing, rhythm-controllable MusicGen and multi-agent assessment, arrangement and mix——to simulate the process of a human musician producing music for film. Three large color blocks represent the three main modules, through which the input Film clips pass sequentially, ultimately outputting a waveform. The three blue blocks with musical notation illustrate the complete music production process, from setting the rhythm points, composing, to arranging and mixing.

Dataset

Dataset Image

Given the lack of a suitable dataset to train FilmComposer, we construct a large-scale film music dataset MusicPro-7k featuring about 7,418 samples, each with film clip, music, visual description, music description, main melody, and rhythm spots. MusicPro-7k is distinguished by its high quality, high level of professional expertise, and multi-functionality. The structure and construction method of MusicPro-7k are illustrated in the gram. Instead of extracting music from videos and applying noise reduction as done in earlier approaches, we invited musicians to find music specifically for silent film clips. This approach avoids the poor audio quality and degradation caused by previous extraction methods.