Distance Based Single-Channel Target Speech Extraction
Author: Runwu Shi, Benjamin Yen, Kazuhiro Nakadai
Abstract: This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that only utilizes distance cues without any speaker-related information for single-channel TSE. Inspired by recent single-channel distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for target speech extraction. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. Additionally, this method can also be employed to estimate the distances of different speakers in mixed speech.
 
        D1: Fix room & mic
D1 Sample 1
Room size: 7*8*3m, RT60: 0.2s
Speaker distance: 0.77m, 3.17m
 
                 
                Mixture input
 
                        
                    Clean speech at 0.77m
 
                        
                    Clean speech at 3.17m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D1 Sample 2
Room size: 7*8*3m, RT60: 0.2s
Speaker distance: 2.81m, 3.35m
 
                 
                Mixture input
 
                        
                    Clean speech at 2.81m
 
                        
                    Clean speech at 3.35m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D2: Fix room & random mics
D2 Sample 1
Room size: 7*8*3m, RT60: 0.3s
Speaker distance: 3.16m, 3.77m
 
                 
                Mixture input
 
                        
                    Clean speech at 3.16m
 
                        
                    Clean speech at 3.77m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D2 Sample 2
Room size: 7*8*3m, RT60: 0.3s
Speaker distance: 3.25m, 4.86m
 
                 
                Mixture input
 
                        
                    Clean speech at 3.25m
 
                        
                    Clean speech at 4.86m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D3: Random rooms & random mics
D3 Sample 1
Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m
Speaker distance: 0.76m, 2.23m
 
                 
                Mixture input
 
                        
                    Clean speech at 0.76m
 
                        
                    Clean speech at 2.23m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D3 Sample 2
Room size: 7.094*5.166*2.991m, RT60: 0.495s, mic position: 3.626*2.014*1.2m
Speaker distance: 3.08m, 4.01m
 
                 
                Mixture input
 
                        
                    Clean speech at 3.08m
 
                        
                    Clean speech at 4.01m
 
                        
                    Query distance 0m
 
                            
                        Query distance 0.625m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 1.875m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.125m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 4.375m
 
                            
                        Query distance 5m
 
                            
                        D4: Real RIR
D4 Sample
Room size: 6.447*9.141*2.543m, RT60: 0.408s, mic position: 4.466*4.663*1.2m
Speaker distance: 0.76m, 2.23m
 
                 
                Mixture input
 
                        
                    Clean speech at 0.59m
 
                        
                    Clean speech at 8.0m
 
                        
                    Query distance 0m
 
                            
                        Query distance 1.25m
 
                            
                        Query distance 2.5m
 
                            
                        Query distance 3.75m
 
                            
                        Query distance 5.0m
 
                            
                        Query distance 6.25m
 
                            
                        Query distance 7.5m
 
                            
                        Query distance 8.75m
 
                            
                        Query distance 10.0m
