The whole concept of natural language processing (NLP) is to create models that can understand and generate language just like humans do. These models, known as language models, use statistical patterns and structures of text to predict and generate coherent sequences of words. Now, with the power of the Transformer model architecture and training on large amounts of text, we have seen huge advancements in NLP tasks through large language models (LLMs).
But here’s the thing, modeling spoken human language is a whole different ballgame. Traditionally, spoken dialog systems have been built using separate systems for speech recognition, natural language understanding, response generation, and text-to-speech. However, there hasn’t been much success in creating end-to-end systems for modeling spoken language – systems that can take speech as input and generate speech as output.
That is, until now. We introduce Spectron, a groundbreaking spoken language model that is trained end-to-end to directly process spectrograms as both input and output. Unlike other models, Spectron doesn’t rely on learning discrete speech representations. Instead, it fine-tunes a pre-trained text language model to generate high-quality spoken language that is both semantically accurate and natural.
One of the key improvements of Spectron is its ability to retain the knowledge of the original LLM. By connecting the encoder of a speech recognition model with a pre-trained Transformer-based decoder language model, we enable end-to-end training and achieve state-of-the-art performance without sacrificing accuracy. This is all thanks to a novel training objective that supervises speech recognition, text continuation, and conditional speech synthesis in a joint manner.
Another crucial aspect of Spectron is its use of a spectrogram regression loss. This loss function ensures that the model accurately matches the higher-order derivatives of the spectrogram in both the time and frequency domain. These derivatives contain valuable information about the shape of the signal, giving our model a deeper understanding of the spoken language.
Here’s a summary of the Spectron architecture: the speech encoder takes the spectrogram of the source speech as input and generates a hidden representation that combines linguistic and acoustic information. This representation is then fed into the decoder, which uses it as a prefix to generate both text and speech continuations. The entire architecture is initialized with a pre-trained speech encoder and a pre-trained decoder language model.
To evaluate the performance of Spectron, we conducted experiments on the Libri-Light dataset, which consists of unlabelled speech readings. We compared our model with other spoken language models like AudioLM, GSLM, TWIST, and SpeechGPT. In terms of log-perplexity (cohesion and semantic quality), Spectron outperforms most models and does slightly better than AudioLM. When it comes to MOS (how natural the speech sounds), Spectron exceeds the performance of all other models except for AudioLM. And in speaker similarity (how similar the generated speech is to the input speaker), Spectron outperforms all other models.
But Spectron doesn’t stop at speech continuation. We also tested its ability to answer questions using the LLama Questions and Spoken WebQuestions datasets. Our model achieved high accuracy in answering questions, outperforming other models on both datasets.
In conclusion, Spectron is a game-changer in the field of spoken language modeling. By directly processing spectrograms and using an end-to-end approach, it produces high-quality spoken language that is semantically accurate and natural. With its impressive performance on various metrics, Spectron takes us one step closer to truly understanding and generating human-like speech.