Polyphonia: Training-Free Context-Aware Music Editing with Acoustic-Informed Attention Calibration

Under Review

Overview of Polyphonia
Figure 1: Overview of Polyphonia.

Abstract

The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve context-aware editing, which requires altering specific stems while strictly preserving the background accompaniment. This limitation severely hinders practical application, since real-world production necessitates precise manipulation of components within dense mixtures. Our key finding is that, while vanilla cross-attention captures semantic intent, it lacks the spectral resolution to strictly localize targets in dense mixtures, leading to boundary leakage. To resolve this dilemma, we propose Polyphonia, a training-free editing framework with Acoustic-Informed Attention Calibration. Rather than relying solely on diffuse semantic attention, Polyphonia leverages a probabilistic acoustic prior to establish coarse boundaries, enabling background context preserved precise semantic synthesis. For evaluation, we propose PolyEvalPrompts, a standardized prompt set with 1,170 context-aware music editing tasks. Specifically, Polyphonia achieves an increase of 15.5% in target alignment compared to baselines, while maintaining competitive music fidelity and background integrity.

Comparisons with Baselines

Guitar To Violin

Source Description: A recording of soulful male vocals, deep bass, distorted guitar and punchy drums.

Target Prompt: A recording of soulful male vocals, deep bass, smooth violin and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Drums To Electronic

Source Description: A recording of a vocals melody, a funky bass, a clean electric guitar and tight drums.

Target Prompt: A recording of a vocals melody, a funky bass, a clean electric guitar and crisp electronic drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Piano To Organ

Source Description: A recording of soulful male vocals, a deep bass, a soft piano and ambient synth pad.

Target Prompt: A recording of soulful male vocals, a deep bass, a warm organ and ambient synth pad.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Bass To Upright

Source Description: A recording of a soulful female vocals, a deep bass, a clean electric guitar, a warm acoustic guitar and punchy drums.

Target Prompt: A recording of a soulful female vocals, an upright bass, a clean electric guitar, a warm acoustic guitar and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Vocals To Saxophone

Source Description: A recording of a soulful female vocals, a deep bass, a clean electric guitar, a warm acoustic guitar and punchy drums.

Target Prompt: A recording of a smooth saxophone, a deep bass, a clean electric guitar, a warm acoustic guitar and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Vocals To Cello

Source Description: A recording of soft, emotive female vocals, a clean electric guitar, a soft fingerpicked acoustic guitar and gentle drums.

Target Prompt: A recording of a warm cello, a clean electric guitar, a soft fingerpicked acoustic guitar and gentle drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Guitar To Piano

Source Description: A recording of soulful female vocals, deep bass, a clean electric guitar and punchy drums.

Target Prompt: A recording of soulful female vocals, deep bass, a bright piano and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Guitar To Ukulele

Source Description: A recording of soulful female vocals, deep bass, a clean electric guitar and punchy drums.

Target Prompt: A recording of soulful female vocals, deep bass, a cheerful ukulele and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Electronic Guitar To Acoustic Guitar

Source Description: A recording of a smooth and sultry female vocal, a deep bass guitar, a clean electric guitar and punchy drums.

Target Prompt: A recording of a smooth and sultry female vocal, a deep bass guitar, a bright acoustic guitar and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Vocals To Flute

Source Description: A recording of a smooth and sultry female vocal, a deep bass guitar, a clean electric guitar and punchy drums.

Target Prompt: A recording of a lush flute, a deep bass guitar, a clean electric guitar and punchy drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Saxophone To Organ

Source Description: A recording of a smooth saxophone solo, a fingered electric bass and a standard drum kit.

Target Prompt: A recording of a organ melody, a fingered electric bass and a standard drum kit.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

String section To Guitar

Source Description: A recording of a light bouncy string section, bright clear woodwind solos and rhythmic percussion.

Target Prompt: A recording of a guitar, bright clear woodwind solos and rhythmic percussion.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Vocals To Violin

Source Description: A recording of smooth, melodic male, walking bassline, rhythmic strumming and steady backbeat.

Target Prompt: A recording of a violin, rhythmic strumming, walking bassline and steady backbeat.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Vocals To Cello

Source Description: A recording of a soft breathy vocal, a steady, walking bassline, a arpeggiated guitar melody and a simple, steady backbeat.

Target Prompt: A recording of a cello, a steady, walking bassline, a arpeggiated guitar melody and a simple, steady backbeat.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Piano To Harp

Source Description: A recording of a melodic piano, a walking bass line and a soft brushed drum beat.

Target Prompt: A recording of a melodic harp, a walking bass line and a soft brushed drum beat.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Guitar To Piano

Source Description: A recording of a distorted electric guitar melody, a solid bass line and driving energetic drums.

Target Prompt: A recording of a rapid piano, a solid bass line and driving energetic drums.

Source Audio
Polyphonia (Ours)
SDEdit
MusicGen
MusicMagus
DDPM-Friendly
DDIM Inversion
SteerMusic
Melodia

Editing Paradigm (Holistic vs. Sep-Remix)

Ablation studies demonstrating the effectiveness of holistic editing versus separate remix approaches.

Vocals to Violin

Source Description: A recording of soulful male vocals, deep bass, a melodic piano and punchy drums.

Target Prompt: A recording of a smooth violin, deep bass, a melodic piano and punchy drums.

Source Audio
Polyphonia (Ours)
DDPM
Melodia

Electronic Guitar to Acoustic Guitar

Source Description: A recording of a smooth and sultry male vocal, a deep bass guitar, a clean electric guitar and punchy drums.

Target Prompt: A recording of a smooth and sultry male vocal, a deep bass guitar, a bright acoustic guitar and punchy drums.

Source Audio
Polyphonia (Ours)
DDPM
Melodia