GitHub repo

Keep for anonymous

Abstract

Recent advances in text-to-music editing, which employ text queries to modify music (e.g.\ by changing its style or adjusting instrumental components), present unique challenges and opportunities for music generation. Previous approaches in this domain have been constrained by the necessity to train specific editing models from scratch, which is both resource-intensive and inefficient; other research uses large language models to predict edited music, resulting in imprecise audio reconstruction. To Combine the strengths and address these limitations, we introduce Instruct-MusicGen, a novel approach that finetunes a pretrained MusicGen model to efficiently follow editing instructions such as adding, removing, or separating stems. Our approach involves a modification of the original MusicGen architecture by incorporating a text fusion module and an audio fusion module, which allow the model to process instruction texts and audio inputs concurrently and yield the desired edited music. Remarkably, Instruct-MusicGen only introduces 8% new parameters to the original MusicGen model and only trains for 5K steps, yet it achieves superior performance across all tasks compared to existing baselines, and demonstrates performance comparable to the models trained for specific tasks. This advancement not only enhances the efficiency of text-to-music editing but also broadens the applicability of music language models in dynamic music production environments.

Demo

Adding a stem

Slakh

Instruction

add bass

add bass

add piano

add piano

add guitar

Input audio

Untitled

Untitled

Untitled

18c6351e-5a5f-4d05-a2e9-af379f5de924_input_0_0.wav

107.wav

Output audio

Untitled

Untitled

Untitled

18c6351e-5a5f-4d05-a2e9-af379f5de924_output_0_0.wav

107.wav

Ground truth

Untitled

Untitled

Untitled

18c6351e-5a5f-4d05-a2e9-af379f5de924_ground_truth_0_0.wav

107.wav

Extracting a stem

Slakh

Instruction

only drums

only drums

only bass

only bass

only piano

only piano

only guitar

Input audio

Untitled

112.wav

18c6351e-5a5f-4d05-a2e9-af379f5de924_input_0_0 (1).wav

28.wav

20.wav

70.wav

46.wav

Output audio

Untitled

112.wav

18c6351e-5a5f-4d05-a2e9-af379f5de924_output_0_0 (1).wav

28.wav

20.wav

70.wav

46.wav

Ground truth

Untitled

112.wav

18c6351e-5a5f-4d05-a2e9-af379f5de924_ground_truth_0_0 (1).wav

28.wav

20.wav

70.wav

46.wav

Removing a stem

Slakh

Instruction

no bass

no bass

no drums

no drums

no piano

no piano

Input audio

0.wav

30.wav

3.wav

129.wav

114.wav

71.wav

Output audio

0.wav

30.wav

3.wav

129.wav

114.wav

71.wav

Ground truth

0.wav

30.wav

3.wav

129.wav

114.wav

71.wav

Failed cases

Slakh