Audio Samples from "StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"

[arXiv] [GitHub Repo]
Abstract: In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

StyleTTS 2 JETS VITS StyleTTS

This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.

All utterances were unseen during training, and some were selected to match demo samples of non-public models (e.g. NaturalSpeech 1 & 2 or Vall-E) for comparison purposes.

For more samples, you can download our metadata that contains all audios used for evaluations and the survey results here.

Contents

1. Single-Speaker (LJSpeech, In-Distribution Texts)


Text: After the construction and action of the machine had been explained, the doctor asked the governor what kind of men he had commanded at Goree,

Ground Truth StyleTTS 2 NaturalSpeech JETS VITS StyleTTS

Text: The lax discipline maintained in Newgate was still further deteriorated by the presence of two other classes of prisoners who ought never to have been inmates of such a jail.

Ground Truth StyleTTS 2 NaturalSpeech JETS VITS StyleTTS

Text: Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.

Ground Truth StyleTTS 2 NaturalSpeech JETS VITS StyleTTS

Text: it is not possible to state with scientific certainty that a particular small group of fibers come from a certain piece of clothing.

Ground Truth StyleTTS 2 NaturalSpeech JETS VITS StyleTTS

2. Single-Speaker (LJSpeech, Out-Of-Distribution Texts)


This section contains OOD samples with ground truth audios taken from LibriVox. All the 40 clips used in our experiment can be downloaded here. Note the sample quality difference between this section and previous section.

Text: Then leaving the corpse within the house they go themselves to and fro about the city and beat themselves, with their garments bound up by a girdle

Ground Truth StyleTTS 2 JETS VITS StyleTTS

Text: Write your name and address clearly. Mail a note and a duplicate list at the time you send the box.

Ground Truth StyleTTS 2 JETS VITS StyleTTS

Text: Indeed, she is said to have angled with Napoleonic strategy for that same offer, and to have won it only after a sharp struggle of wits.

Ground Truth StyleTTS 2 JETS VITS StyleTTS

Text: not as a lifeless thing, but with the same enjoyment of rest as gladdened the hearts of the two beings, who, with gratitude and love

Ground Truth StyleTTS 2 JETS VITS StyleTTS

3. Multi-Speaker (VCTK)


This section contains samples from our multi-speaker VCTK model, alongside the reference audios that were used to generate these samples. Note that StyleTTS 2 faithfully replicates the speaking styles (e.g., background noise, pitch tone, voice etc.) of the reference audios, making it more similar to the reference than the ground truth.

Text

Since then physicists have found that it is not reflection, but refraction by the raindrops which causes the rainbows. Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern.
Ground Truth
Reference
StyleTTS 2
VITS

4. Zero-shot Speaker Adaptation (LibriTTS)


This section contains samples from our multi-speaker LibriTTS model. For each model, the size of the training set is provided. Note that all speakers are unseen during the training process.

Text

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings. And lay me down in my cold bed and leave my shining lot. The army found the people in poverty and left them in comparative wealth.
Ground Truth
Reference
StyleTTS 2 (~245 Hours)
VALL-E (~60k Hours)
NaturalSpeech 2 (~44k Hours)
StyleTTS (~245 Hours)
VITS (~245 Hours)
YourTTS (~300 Hours)

It is noteworthy that our model matches the generalization capabilities of VALL-E, including maintaining the acoustic environment and speaker's emotion, but with 250x less data. The subsequent samples are sourced from the official VALL-E demo page.

Acoustic Environment Maintenance

Text: As friends thing I definitely I've got more male friends.

Reference StyleTTS 2 VALL-E Ground Truth

Text: Everything is run by computer but you got to know how to think before you can do a computer.

Reference StyleTTS 2 VALL-E Ground Truth

Text: Then out in LA you guys got a whole another ball game within California to worry about.

Reference StyleTTS 2 VALL-E Ground Truth

Speaker’s Emotion Maintenance

We have to reduce the number of plastic bags.

Emotion

Anger

Sleepy

Amused

Disgusted

Reference
StyleTTS 2
Vall-E

5. Longform Narration


We provide one example of long paragraph synthesized using Algorithm 1 described in Appendix B. Our model exhibits robustness against OOD texts in longform narration.

The paragraph is the Fresh and Preserved Fruit for the Market chapter of Canned Fruit, Preserves, and Jellies: Household Methods of Preparation by Maria Parloa. The ground truth clip was taken from LibriVox.

Ground Truth StyleTTS 2 JETS VITS StyleTTS

6. Speech Expressiveness


The samples below were synthesized using texts generated by GPT-4 in four distinct emotions: happiness, sadness, anger, and surprise. These samples were generated using both LJSpeech and LibriTTS models, in support of Figure 2.

Additionally, we demonstrate the potential to synthesize expressive speech from an unseen speaker for this task using the first speaker (1221-135767) from section 4.

Text

Happy: We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands. Sad: I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence. Angry: The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations. Surprised: I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond?
LJSpeech
Unseen Speaker

Text

Happy: He was a merry fellow, this Jack Sheppard, and his exploits were the talk of the town. Sad: He was condemned to death, and suffered on the gallows at Tyburn, protesting his innocence to the last, and leaving behind him a touching farewell to his wife and children. Angry: We must angrily reject the status quo that benefits only the rich! Surprised: Holmes, you astound me! How did you deduce that the murderer was none other than the victim's own brother?
LJSpeech
Unseen Speaker

Style Transfer

Since our model disentangles speech and style vectors, it is capable of style transfer to any input text. This is achieved by first sampling a style with an emotional text and then synthesizing the speech with this emotional style vector.

The ensuing samples were synthesized with styles sampled through style diffusion conditioned on texts with explicit emotions. Note that neither the target text nor the reference audio contains any emotional content.

Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.

Angry Happy Sad Surprised

7. Speech Diversity


How much variation is there?

StyleTTS 2
VITS
FastDiff
StyleTTS 2 (unseen speaker)

8. Ablation Study


This section provides samples from ablated models using text from the test-clean subset of the LibriTTS dataset. The differences may be nuanced, so below we have outlined each model variant and its implications:

Text: The answer to this will depend upon the length of the play, for upon the length depends the hour at which the curtain rises.

Baseline No Style Diffusion No Prosodic Style Encoder No SLM Discriminator No Differentiable Upsampler No OOD Texts

Text: Well, sir, we never make coffee but in the afternoon. Would you like a good bavaroise, or a decanter of orgeat?

Baseline No Style Diffusion No Prosodic Style Encoder No SLM Discriminator No Differentiable Upsampler No OOD Texts

Text: On the evening of the day of Alexandra's call at the Shabatas', a heavy rain set in.

Baseline No Style Diffusion No Prosodic Style Encoder No SLM Discriminator No Differentiable Upsampler No OOD Texts

Text: Ojo was hungry, though; so he divided the piece of bread upon the table and ate his half for breakfast, washing it down with fresh, cool water from the brook.

Baseline No Style Diffusion No Prosodic Style Encoder No SLM Discriminator No Differentiable Upsampler No OOD Texts