Distance Based Single-Channel Target Speech Extraction

Author: Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

Abstract: This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that only utilizes distance cues without any speaker-related information for single-channel TSE. Inspired by recent single-channel distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for target speech extraction. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. Additionally, this method can also be employed to estimate the distances of different speakers in mixed speech.

Image 1

D1: Fix room & mic

D1 Sample 1

Room size: 7*8*3m, RT60: 0.2s

Speaker distance: 0.77m, 3.17m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 0.77m

Main Image

Clean speech at 3.17m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D1 Sample 2

Room size: 7*8*3m, RT60: 0.2s

Speaker distance: 2.81m, 3.35m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 2.81m

Main Image

Clean speech at 3.35m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D2: Fix room & random mics

D2 Sample 1

Room size: 7*8*3m, RT60: 0.3s

Speaker distance: 3.16m, 3.77m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 3.16m

Main Image

Clean speech at 3.77m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D2 Sample 2

Room size: 7*8*3m, RT60: 0.3s

Speaker distance: 3.25m, 4.86m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 3.25m

Main Image

Clean speech at 4.86m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D3: Random rooms & random mics

D3 Sample 1

Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m

Speaker distance: 0.76m, 2.23m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 0.76m

Main Image

Clean speech at 2.23m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D3 Sample 2

Room size: 7.094*5.166*2.991m, RT60: 0.495s, mic position: 3.626*2.014*1.2m

Speaker distance: 3.08m, 4.01m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 3.08m

Main Image

Clean speech at 4.01m

Main Image

Query distance 0m

Spectrum 1

Query distance 0.625m

Spectrum 2

Query distance 1.25m

Spectrum 3

Query distance 1.875m

Spectrum 4

Query distance 2.5m

Spectrum 5

Query distance 3.125m

Spectrum 6

Query distance 3.75m

Spectrum 7

Query distance 4.375m

Spectrum 8

Query distance 5m

Spectrum 9

D4: Real RIR

D4 Sample

Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m

Speaker distance: 0.76m, 2.23m

Image 1
Image 2

Mixture input

Main Image

Clean speech at 0.59m

Main Image

Clean speech at 8.0m

Main Image

Query distance 0m

Spectrum 1

Query distance 1.25m

Spectrum 2

Query distance 2.5m

Spectrum 3

Query distance 3.75m

Spectrum 4

Query distance 5.0m

Spectrum 5

Query distance 6.25m

Spectrum 6

Query distance 7.5m

Spectrum 7

Query distance 8.75m

Spectrum 8

Query distance 10.0m

Spectrum 9