Unsupervised Single-Channel
Speech Separation with a Diffusion Prior
under Speaker-Embedding Guidance

Runwu Shi, Kai Li, Chang Li, Jiang Wang, Sihan Tan, Kazuhiro Nakadai

GitHub Repository
Abstract: Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems always rely on synthetic data pipelines that do not necessarily reflect the characteristics of real-world mixtures. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose speaker-embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies markedly enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results.

VCTK-2mix

Audio Example 1

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Audio Example 2

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Audio Example 3

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Audio Example 4

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Audio Example 5

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

mismatch WSJ0-2mix

Audio Example 1

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 3

Audio Example 2

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 3

Audio Example 3

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 3

Audio Example 4

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 3

Audio Example 5

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 3

Comparision

Proposed method

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Dirac Sampling

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

DSG

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

DPS

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2

Analytical Sampling

Mixture

Separation Results

Ground Truth

Mixture Waveform 1
Separation Waveform 1
Ground Truth Waveform 1
Separation Waveform 2
Separation Waveform 2