Abstract: Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems always rely on synthetic data pipelines that do not necessarily reflect the characteristics of real-world mixtures. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources.
To address this limitation, we propose speaker-embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies markedly enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results.