Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Advancing Speaker Recognition through Disentanglement Framework

Automatic speaker recognition aims to identify a person from his or her voice based on speech recordings. It is difficult to extract an accurate speaker representation from speech because of its mixture of speaker traits and content. This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.

It is realised with the use of three Gaussian inference layers, each consisting of a learnable transition model that extracts distinct speech components. Notably, a strengthened transition model is specifically designed to model complex speech dynamics. We also propose a self-supervision method to dynamically disentangle content without the use of labels other than speaker identities. The efficacy of the proposed framework is validated via experiments conducted on the widely used VoxCeleb and speakers in the wild (SITW) datasets, with 9.56% and 8.24% average improvements in both criteria of equal error rate (EER) and minimum detection cost function (minDCF), respectively.

A Breakthrough for Advanced Speaker Verification and Beyond

Disentangling various components from speech holds promise for enhancing multiple downstream speech tasks, such as speaker verification and automatic speech recognition (ASR) In our study, we validate this concept by effectively disentangling content and speaker characteristics from speech. Leveraging these speaker representations significantly improves speaker verification, achieving excellent performance. Additionally, we address the challenge of missing labels within the content disentanglement process. This reduces the training cost, making this framework more suitable for real-world applications. The disentangled content can further be utilized for ASR task or combined with different disentangled speaker representations to achieve voice conversion capabilities.

Our work not only yields advantages for speaker verification but also serves as a breakthrough in speech disentanglement. It's accomplished by ingeniously transforming the speech disentanglement problem into modeling both static and dynamic speech components, utilizing multi-layer Gaussian inference to effectively achieve disentanglement.

This breakthrough serves as an important foundation framework and paves the way for disentangling all speech components in the future, mirroring the nature and capability of the human brain system.

It is the significance of this breakthrough that won us the Best Paper award in the 2023 International Doctoral Forum, organized by The Chinese University of Hong Kong (CUHK) and Microsoft, against entries from prestigious universities and research institutes such as CUHK, The Hong Kong Polytechnic University, The Hong Kong University of Science and Technology, Tsinghua University, Microsoft, and the University of Science and Technology of China.

Practical Applications of Speech Disentanglement in Neural Networks

into speech disentanglement learning offers benefits across diverse speech-related tasks, including but not limited to speaker verification, speech recognition, voice conversion, and speech privacy preservation. This approach exhibits significant potential for neural networks to collectively learn multiple speech tasks, understanding their interrelations, and ultimately enhancing overall task performance. The prospect of developing a unified system akin to the human brain is promising. For the speaker verification task focused in this work, since neither additional model training nor data is specifically needed, it is easily applicable in practical use with state-of-the-art performance.

The Next Steps

Our next stage involves reconstructing speech signals from the disentangled components. We aim to leverage the interactions among speaker verification, automatic speech recognition, voice privacy-preserving, and voice conversion, fostering mutual benefits among these aspects. Looking ahead, our goal is to disentangle various components from speech accurately, thereby constructing a unified system capable of handling all speech tasks. This approach offers the advantage of learning and utilising the interactions between different speech-related tasks. Additionally, this unified speech system aims to cultivate language and speech capabilities akin to those of humans, with plans to build a large speech model.

You can read the paper on Disentangling Voice and Content with Self-Supervision for Speaker Recognition here.

Connect with us

I²R Research Highlights

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

Connect with us

I²R Research Highlights

Disentangling Voice and Content with Self-Supervision for Speaker Recognition

A*STAR celebrates International Women's Day