AES E-Library

Multichannel speaker diarization with arbitrary microphone arrays

Speaker diarization remains a field with potential for improvement. In meeting scenarios, the task of labeling audio with the corresponding speaker identities, can be further assisted with the exploitation of spatial features. In the present work, a framework is designed, to evaluate the combination of speaker embeddings with Time Difference Of Arrival (TDOA) values. Speaker embeddings are extracted using two popular pre-trained models, ECAPA – TDNN and Xvectors. TDOA values for every speech segment are calculated using the Generalized Cross Correlation (GCC) method with phase transform (PHAT) weights (GCC – PHAT). The outputs of GCC – PHAT and deep neural network (DNN) systems are fused by concatenation and used as the input to spectral clustering. The objective of the proposed framework is to evaluate the potential of exploiting available microphone arrays in meetings and the investigation of complementary information between TDOA and speaker embeddings. The system is evaluated on two different datasets, the AVLab Speaker Localization and a multichannel dataset created in the context of the present work. Furthermore, an additional dataset using mobile phones embedded microphones is created and openly distributed to assist research groups to find solutions to complex problems such as speaker localization and diarization with arbitrary arrays comprising microphones of different characteristics and quality.

 

Author (s):
Affiliation: (See document for exact affiliation information.)
AES Convention: Paper Number:
Publication Date:
Session subject:

DOI:


Click to purchase paper as a non-member or login as an AES member. If your company or school subscribes to the E-Library then switch to the institutional version. If you are not an AES member Join the AES. If you need to check your member status, login to the Member Portal.

Type:
16938
Choose your country of residence from this list: