Single-Channel Target Speech Extraction Utilizing Distance and Room Clues
Author: Runwu Shi, Zirui Lin, Benjamin Yen, Kazuhiro Nakadai
Abstract: This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance cues and room information. Recent research has verified the feasibility of distance cues, which implies the sound source’s direct-to- reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance cue is significantly influenced by the room acoustic environment such as dimension and reverberant time, making it challenging for TSE systems that rely solely on distance cues to generalize across a variety of different rooms. To solve this, we sug- gest providing room environmental information for distance- based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding, and the results on both simulated and real collected dataset demonstrate its feasibility.

Dataset: Sim1
Clue: Distance
Room size: 7*8*3m, RT60: 0.20s
Speaker distance: 0.495m, 2.072m
Mixture input

Ground truth at 0.495m

Ground truth at 2.072m

Query distance 0m

Query distance 0.625m

Query distance 1.25m

Query distance 1.875m

Query distance 2.5m

Query distance 3.125m

Query distance 3.75m

Query distance 4.375m

Query distance 5m

Dataset: Sim2
Clue: Distance
Room size: 6.372*6.457*2.987m, RT60: 0.344s
Speaker distance: 4.13m, 4.678m
Mixture input

Ground truth at 4.13m

Ground truth at 4.678m

Query distance 0m

Query distance 0.625m

Query distance 1.25m

Query distance 1.875m

Query distance 2.5m

Query distance 3.125m

Query distance 3.75m

Query distance 4.375m

Query distance 5m

Dataset: Sim2
Clue: Distance + Room configuration + Reverberation time
Room size: 5.66*9.974*2.764m, RT60: 0.291s
Speaker distance: 0.934m, 4.104m
Mixture input

Ground truth at 0.934m

Ground truth at 4.104m

Query distance 0m

Query distance 0.625m

Query distance 1.25m

Query distance 1.875m

Query distance 2.5m

Query distance 3.125m

Query distance 3.75m

Query distance 4.375m

Query distance 5m

Dataset: RealRIR
Clue: Distance+Dim+Rt
Room size: 5.9*6.9*2.9m, RT60: 0.60s
Speaker distance: 1.0m, 2.828m
Mixture input

Ground truth at 1.0m

Ground truth at 2.828m

Query distance 0m

Query distance 0.625m

Query distance 1.25m

Query distance 1.875m

Query distance 2.5m

Query distance 3.125m

Query distance 3.75m

Query distance 4.375m

Query distance 5m

Dataset: RealRIR
Finetune+Clue: Distance+Dim+Rt
Room size: 5.9*6.9*2.9m, RT60: 0.60s
Speaker distance: 1.0m, 2.828m
Mixture input

Ground truth at 1.0m

Ground truth at 2.828m

Query distance 0m

Query distance 0.625m

Query distance 1.25m

Query distance 1.875m

Query distance 2.5m

Query distance 3.125m

Query distance 3.75m

Query distance 4.375m

Query distance 5m
