StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|
This page contains a set of audio samples in support of the paper. Some examples are randomly selected directly from the sets we used for evaluation.
All utterances were unseen during training, and some were selected to match demo samples of non-public models (e.g. NaturalSpeech 1 & 2 or Vall-E) for comparison purposes.
For more samples, you can download our metadata that contains all audios used for evaluations and the survey results here.
Text: After the construction and action of the machine had been explained, the doctor asked the governor what kind of men he had commanded at Goree,
Ground Truth | StyleTTS 2 | NaturalSpeech | JETS | VITS | StyleTTS |
---|---|---|---|---|---|
Text: The lax discipline maintained in Newgate was still further deteriorated by the presence of two other classes of prisoners who ought never to have been inmates of such a jail.
Ground Truth | StyleTTS 2 | NaturalSpeech | JETS | VITS | StyleTTS |
---|---|---|---|---|---|
Text: Maltby and Co. would issue warrants on them deliverable to the importer, and the goods were then passed to be stored in neighboring warehouses.
Ground Truth | StyleTTS 2 | NaturalSpeech | JETS | VITS | StyleTTS |
---|---|---|---|---|---|
Text: it is not possible to state with scientific certainty that a particular small group of fibers come from a certain piece of clothing.
Ground Truth | StyleTTS 2 | NaturalSpeech | JETS | VITS | StyleTTS |
---|---|---|---|---|---|
This section contains OOD samples with ground truth audios taken from LibriVox. All the 40 clips used in our experiment can be downloaded here. Note the sample quality difference between this section and previous section.
Text: Then leaving the corpse within the house they go themselves to and fro about the city and beat themselves, with their garments bound up by a girdle
Ground Truth | StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|---|
Text: Write your name and address clearly. Mail a note and a duplicate list at the time you send the box.
Ground Truth | StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|---|
Text: Indeed, she is said to have angled with Napoleonic strategy for that same offer, and to have won it only after a sharp struggle of wits.
Ground Truth | StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|---|
Text: not as a lifeless thing, but with the same enjoyment of rest as gladdened the hearts of the two beings, who, with gratitude and love
Ground Truth | StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|---|
This section contains samples from our multi-speaker VCTK model, alongside the reference audios that were used to generate these samples. Note that StyleTTS 2 faithfully replicates the speaking styles (e.g., background noise, pitch tone, voice etc.) of the reference audios, making it more similar to the reference than the ground truth.
Text |
Since then physicists have found that it is not reflection, but refraction by the raindrops which causes the rainbows. | Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. | She can scoop these things into three red bags, and we will go meet her Wednesday at the train station. | Jim Wallace, the justice minister, acknowledged that prisoner numbers were a concern. |
Ground Truth | ||||
Reference | ||||
StyleTTS 2 | ||||
VITS |
This section contains samples from our multi-speaker LibriTTS model. For each model, the size of the training set is provided. Note that all speakers are unseen during the training process.
Text |
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. | Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings. | And lay me down in my cold bed and leave my shining lot. | The army found the people in poverty and left them in comparative wealth. |
Ground Truth | ||||
Reference | ||||
StyleTTS 2 (~245 Hours) | ||||
VALL-E (~60k Hours) | ||||
NaturalSpeech 2 (~44k Hours) | ||||
StyleTTS (~245 Hours) | ||||
VITS (~245 Hours) | ||||
YourTTS (~300 Hours) |
Text: As friends thing I definitely I've got more male friends.
Reference | StyleTTS 2 | VALL-E | Ground Truth |
---|---|---|---|
Text: Everything is run by computer but you got to know how to think before you can do a computer.
Reference | StyleTTS 2 | VALL-E | Ground Truth |
---|---|---|---|
Text: Then out in LA you guys got a whole another ball game within California to worry about.
Reference | StyleTTS 2 | VALL-E | Ground Truth |
---|---|---|---|
Emotion |
Anger |
Sleepy |
Amused |
Disgusted |
Reference | ||||
StyleTTS 2 | ||||
Vall-E |
We provide one example of long paragraph synthesized using Algorithm 1 described in Appendix B. Our model exhibits robustness against OOD texts in longform narration.
The paragraph is the Fresh and Preserved Fruit for the Market chapter of Canned Fruit, Preserves, and Jellies: Household Methods of Preparation by Maria Parloa. The ground truth clip was taken from LibriVox.
Ground Truth | StyleTTS 2 | JETS | VITS | StyleTTS |
---|---|---|---|---|
The samples below were synthesized using texts generated by GPT-4 in four distinct emotions: happiness, sadness, anger, and surprise. These samples were generated using both LJSpeech and LibriTTS models, in support of Figure 2.
Additionally, we demonstrate the potential to synthesize expressive speech from an unseen speaker for this task using the first speaker (1221-135767) from section 4.
Text |
Happy: We are happy to invite you to join us on a journey to the past, where we will visit the most amazing monuments ever built by human hands. | Sad: I am sorry to say that we have suffered a severe setback in our efforts to restore prosperity and confidence. | Angry: The field of astronomy is a joke! Its theories are based on flawed observations and biased interpretations. | Surprised: I can't believe it! You mean to tell me that you have discovered a new species of bacteria in this pond? |
LJSpeech | ||||
Unseen Speaker |
Text |
Happy: He was a merry fellow, this Jack Sheppard, and his exploits were the talk of the town. | Sad: He was condemned to death, and suffered on the gallows at Tyburn, protesting his innocence to the last, and leaving behind him a touching farewell to his wife and children. | Angry: We must angrily reject the status quo that benefits only the rich! | Surprised: Holmes, you astound me! How did you deduce that the murderer was none other than the victim's own brother? |
LJSpeech | ||||
Unseen Speaker |
Since our model disentangles speech and style vectors, it is capable of style transfer to any input text. This is achieved by first sampling a style with an emotional text and then synthesizing the speech with this emotional style vector.
The ensuing samples were synthesized with styles sampled through style diffusion conditioned on texts with explicit emotions. Note that neither the target text nor the reference audio contains any emotional content.
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.Angry | Happy | Sad | Surprised |
---|---|---|---|
StyleTTS 2 | |||||
VITS | |||||
FastDiff | |||||
StyleTTS 2 (unseen speaker) |
This section provides samples from ablated models using text from the test-clean subset of the LibriTTS dataset. The differences may be nuanced, so below we have outlined each model variant and its implications:
Text: The answer to this will depend upon the length of the play, for upon the length depends the hour at which the curtain rises.
Baseline | No Style Diffusion | No Prosodic Style Encoder | No SLM Discriminator | No Differentiable Upsampler | No OOD Texts |
---|---|---|---|---|---|
Text: Well, sir, we never make coffee but in the afternoon. Would you like a good bavaroise, or a decanter of orgeat?
Baseline | No Style Diffusion | No Prosodic Style Encoder | No SLM Discriminator | No Differentiable Upsampler | No OOD Texts |
---|---|---|---|---|---|
Text: On the evening of the day of Alexandra's call at the Shabatas', a heavy rain set in.
Baseline | No Style Diffusion | No Prosodic Style Encoder | No SLM Discriminator | No Differentiable Upsampler | No OOD Texts |
---|---|---|---|---|---|
Text: Ojo was hungry, though; so he divided the piece of bread upon the table and ate his half for breakfast, washing it down with fresh, cool water from the brook.
Baseline | No Style Diffusion | No Prosodic Style Encoder | No SLM Discriminator | No Differentiable Upsampler | No OOD Texts |
---|---|---|---|---|---|