After ChatGPT and DALL-E, meet VALL-E - the text-to-speech AI that can mimic anyone’s voice
]
Last year saw the emergence of artificial intelligence tools (AI) that can create images, artwork, or even video with a text prompt.
There were also major steps forward in AI writing, with OpenAI’s ChatGPT causing widespread excitement - and fear - about the future of writing.
Now, just a few days into 2023, another powerful use case for AI has stepped into the limelight - a text-to-voice tool that can impeccably mimic a person’s voice.
Developed by Microsoft, VALL-E can take a three-second recording of someone’s voice, and replicate that voice, turning written words into speech, with realistic intonation and emotion depending on the context of the text.
Trained with 60,000 hours worth of English speech recordings, it can deliver a speech in a “zero-shot situation,” which means without any prior examples or training in a specific context or situation.
Introducing VALL-E in a paper published by Cornell University, the developers explained that the recording data consisted of more than 7,000 unique speakers.
The team say their Text To Speech system (TTS) used hundreds of times more data than the existing TTS systems, helping them to overcome the zero-shot issue.
The tool is not currently available for public use - but it does throw up questions about safety, given it could feasibly be used to generate any text coming from anybody’s voice.
Microsoft betting big on AI
Chart showing how VALL-E works Microsoft
Its creators have, however, provided a demo, showcasing a number of three-second speaker prompts and a demonstration of the text-to-speech in action, with the voice correctly mimicked.
Alongside the speaker prompt and VALL-E’s output, you can compare the results with the “ground truth” - the actual speaker reading the prompt text - and the “baseline” result from current TTS technology.
Microsoft has invested heavily in AI and is one of the backers of OpenAI, the company behind ChatGPT and DALL-E, a text-to-image or art tool.
The software giant invested $1 billion (€930 million) in OpenAI in 2019, and a report this week on semafor.com stated it was looking at investing another $10 billion (€9.3 billion) in the company.
Microsoft unveils VALL-E, an advanced text-to-speech AI that can speak in anyone’s voice based on a 3-second sample
]
Microsoft has revealed details of its latest foray into the world of artificial intelligence. Billed as a “neural codec language model”, VALL-E is an advanced AI-driven text-to-speech (TTS) system that the developers say can be trained to speak like anyone’s based on just a three-second sample of their voice.
The result is an incredibly natural-sounding TTS system that takes an entirely different approach to existing systems. Able to convey tone and emotion better than ever, VALL-E sounds realistically human, but there are concerns that it could be used for audio deepfakes.
See also:
Advertisement
The AI has been built and trained using 60,000 hours of audio input from thousands of individuals, including public domain audio books. Working with a short sample, VALL-E is able to closely mimic the tone and timbre of a voice in a way that has simply not been possible previously.
Writing about VALL-E, a team of Microsoft researchers say:
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
The team goes on to say: “Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis”.
You can find out more over on the VALL-E demo page where there are numerous samples of how it sounds based on various training inputs.
Image credit: ra2studio / depositphotos
VALL-E AI can mimic a person’s voice from a 3-second snippet
]
Microsoft researchers are working on a text-to-speech (TTS) model that can mimic a person’s voice – complete with emotion and intonation – after a mere three seconds of training.
The technology – called VALL-E and outlined in a 15-page research paper released this month on the arXiv research site – is a significant step forward for Microsoft. TTS is a highly competitive niche that includes other heavyweights such as Google, Amazon, and Meta.
Redmond is already using artificial intelligence for natural language processing (NLP) through its Nuance business – which it bought for $20 billion last year including both speech recognition and TTS technology. And it’s aggressively investing in and using technology from startup OpenAI – including its ChatGPT tool – possibly in its Bing search engine and its Office suite of applications.
A demo of VALL-E can be found on GitHub.
In the paper, the researchers argue that while the rise of neural networks and end-to-end modeling has rapidly improved the technologies around speech synthesis, there are still problems with the similarity of the voices used and the lack of natural speaking patterns in TTS products. They aren’t the robotic voices of a decade or two ago, but they also don’t come off as completely human either.
Caveats
A lot of work is being put into improving this, but there are serious challenges according to the Microsoft eggheads. Some require clean voice data from a recording studio to capture high-quality speech. And they need to rely on relatively small amounts of training data – large-scale speech libraries found on the internet are not clean enough for the work.
For current zero-shot TTS generators – where the software uses samples not included in the training – the work is complex. It can take hours for the system to apply a person’s voice to typed text.
“Instead of designing a complex and specific network for this problem, the ultimate solution is to train a model with large and diverse data as much as possible, motivated by success in the field of text synthesis,” the researchers wrote, noting that the amount of data being used in text language models in recent years has grown from 16GB of uncompressed text to about a terabyte.
VALL-E is “the first language model-based TTS framework leveraging large, diverse, and multi-speaker speech data,” according to the boffins.
They trained VALL-E with Libri-Light – an open source dataset from Meta that includes 60,000 hours of English speech with more than 7,000 unique speakers. By comparison, other TTS systems are trained using dozens of hours of single-speaker data or hundreds of hours with data from multiple speakers.
VALL-E can keep the acoustic environment of the voice. So if the snippet of voice used as the acoustic prompt in the model is recorded on the telephone, the synthesized spoken text would also sound like it’s coming through the phone.
The capturing of emotion is similar, the researchers claim. If the seconds of recorded voice of the acoustic prompt is emoting anger, then the synthesized speech based on that voice will also display anger.
The result is a TTS model that outperforms others in such areas as natural sounding speech and speaker similarity. Testing also indicates that “the synthesized speech of unseen speakers is as natural as human recordings,” they assert.
The researchers noted some issues that need to be resolved – including that some words in the synthesized speech end up missing, are unclear, or are duplicated. There also isn’t enough coverage of speakers with accents, and there needs to be greater diversity in speaking styles.
The global TTS market is estimated to grow to tens of billions of dollars by the end of the decade, with both established players and startups driving development of the technology. Microsoft’s Nuance business has its TTS product and the software behemoth offers TTS service in Azure. Amazon has Polly, Meta has Meta-TTS, and Google Cloud also offers a service.
All that makes for a crowded space.
The rapid improvement in the technology raises various ethical and legal issues. A person’s voice could be captured and synthesized for use in a wide range of areas – from ads or spam calls to video games or chatbots. They could also be used in deepfakes, with the voice of a politician or celebrity combined with an image to spread disinformation or foment anger.
Patrick Harr, CEO of anti-phishing firm SlashNext, told The Register TTS could also become yet another tool for cybercriminals, who could use it for vishing campaigns – attacks using fraudulent phone calls or voice messages thought to be from a contact the victim knows. It also could be used in more traditional phishing attacks.
“This technology could be extremely dangerous in the wrong hands,” Harr said.
The Microsoft researchers noted the risk of synthesized speak that retains the speaker’s identity. They said it would be possible to build a detection model to discern whether an audio clip is real or synthesized using VALL-E.
Harr said that within a few years, everyone could have “a unique digital DNA pattern powered by blockchain that can be applied to their voice, content they write, their virtual avatar, etc. This would make it much harder for threat actors to leverage AI for voice impersonation of company executives for example, because those impersonations will lack the ‘fingerprint’ of the actual executive.”
Here’s hoping, anyway. ®
Microsoft is working on an AI called VALL-E that can clone your voice from a 3-second audio clip
]
Microsoft announced it is working on a text-to-speech artificial intelligence tool.
VALL-E can clone someone’s voice from a 3-second audio clip and use it to synthesize other words.
It came as the tech giant plans to invest $10 billion in OpenAI’s writing tool ChatGPT.
Sign up for our newsletter for the latest tech news and scoops — delivered daily to your inbox. Loading Something is loading. Thanks for signing up! Access your favorite topics in a personalized feed while you’re on the go. download the app Email address By clicking ‘Sign up’, you agree to receive marketing emails from Insider as well as other partner offers and accept our Terms of Service and Privacy Policy
Microsoft, which has plans to invest $10 billion in ChatGPT, is working on an artificial intelligence called VALL-E that can clone someone’s voice from a three-second audio clip.
VALL-E, trained with 60,000 hours of English speech, is capable of mimicking a voice in “zero-shot scenarios”, meaning the AI tool can make a voice say words it has never heard the voice say before, according to a paper in which the developers introduced the tool.
VALL-E uses text-to-speech technology to convert written words into spoken words in “high-quality personalized” speeches, according to the 16-page paper.
It used recordings of more than 7,000 real speakers from LibriLight– an audiobook dataset made up of public-domain texts read by volunteers – to conduct its sampling. The tech giant released samples of how VALL-E would work, showcasing how the voice of a speaker is cloned.
The AI tool is not currently available for public use and Microsoft hasn’t made it clear what its intended purpose is.
Sharing their findings on the academic site arXiv, the researchers said the results so far showed that VALL-E “significantly outperforms” the most advanced systems of its kind, “in terms of speech naturalness and speaker similarity.”
But they pointed out the lack of diversity of accents among speakers, and that some words in the synthesized speech were “unclear, missed, or duplicated.”
They also included an ethical warning about VALL-E and its risks, saying the tool could be misused, for example in “spoofing voice identification or impersonating a specific speaker”.
“To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E,” the developers wrote in the paper. They didn’t give details of how this could be done.
They added that “if the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice.”
Meanwhile, Microsoft announced Monday it will make OpenAI’s ChatGPT available to its own services and is reportedly in talks to invest $10 billion in the AI writing tool.
While ChatGPT has inspired creativity, such as for a man who wrote a children’s book in one weekend with it, it has raised concerns about whether the tool can be trustworthy.
Microsoft didn’t immediately respond to a request for comment by Insider.
Correction: January 19, 2023 — An earlier version of this story misstated the organisation that published the paper about VALL-E. It was published by researchers for Microsoft on the academic site arXiv.