Distance Based Single-Channel Target Speech Extraction
Author: Runwu Shi, Benjamin Yen, Kazuhiro Nakadai
Abstract: This paper aims to achieve single-channel target
speech extraction (TSE) in enclosures by solely utilizing distance
information. This is the first work that only utilizes distance cues
without any speaker-related information for single-channel TSE.
Inspired by recent single-channel distance-based separation and
extraction methods, we introduce a novel model that efficiently
fuses distance information with time-frequency (TF) bins for
target speech extraction. Experimental results in both single-room
and multi-room scenarios demonstrate the feasibility and
effectiveness of our approach. Additionally, this method can also
be employed to estimate the distances of different speakers in
mixed speech.
D1: Fix room & mic
D1 Sample 1
Room size: 7*8*3m, RT60: 0.2s
Speaker distance: 0.77m, 3.17m
Mixture input
Clean speech at 0.77m
Clean speech at 3.17m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D1 Sample 2
Room size: 7*8*3m, RT60: 0.2s
Speaker distance: 2.81m, 3.35m
Mixture input
Clean speech at 2.81m
Clean speech at 3.35m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D2: Fix room & random mics
D2 Sample 1
Room size: 7*8*3m, RT60: 0.3s
Speaker distance: 3.16m, 3.77m
Mixture input
Clean speech at 3.16m
Clean speech at 3.77m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D2 Sample 2
Room size: 7*8*3m, RT60: 0.3s
Speaker distance: 3.25m, 4.86m
Mixture input
Clean speech at 3.25m
Clean speech at 4.86m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D3: Random rooms & random mics
D3 Sample 1
Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m
Speaker distance: 0.76m, 2.23m
Mixture input
Clean speech at 0.76m
Clean speech at 2.23m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D3 Sample 2
Room size: 7.094*5.166*2.991m, RT60: 0.495s, mic position: 3.626*2.014*1.2m
Speaker distance: 3.08m, 4.01m
Mixture input
Clean speech at 3.08m
Clean speech at 4.01m
Query distance 0m
Query distance 0.625m
Query distance 1.25m
Query distance 1.875m
Query distance 2.5m
Query distance 3.125m
Query distance 3.75m
Query distance 4.375m
Query distance 5m
D4: Real RIR
D4 Sample
Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m