For example, to use SpeechT5 for text-to-speech, you’d swap in the text encoder pre-net for the text inputs and the speech decoder pre and post-nets for the speech outputs. Such a fine-tuned model only uses the pre-nets and post-nets specific to the given task. After pre-training, the entire encoder-decoder backbone is fine-tuned on a single task. The post-net takes the outputs from the Transformer and turns them into text or speech again.Ī figure illustrating SpeechT5’s architecture is depicted below (taken from the original paper).ĭuring pre-training, all of the pre-nets and post-nets are used simultaneously. It is the job of the pre-net to convert the input text or speech into the hidden representations used by the Transformer. To make it possible for the same Transformer to deal with both text and speech data, so-called pre-nets and post-nets were added. This Transformer backbone is the same for all SpeechT5 tasks. Just like any other Transformer, the encoder-decoder network models a sequence-to-sequence transformation using hidden representations. The result of this pre-training approach is a model that has a unified space of hidden representations shared by both text and speech.Īt the heart of SpeechT5 is a regular Transformer encoder-decoder model. This way, the model learns from text and speech at the same time. The main idea behind SpeechT5 is to pre-train a single model on a mixture of text-to-speech, speech-to-text, text-to-text, and speech-to-speech data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |