According to ArsTechnica, Microsoft has developed an AI system that is capable of using machine learning to accurately mimic the voice of anyone, complete with novel, generated sentences, based on just three seconds of audio input.
"On Thursday, Microsoft researchers announced a new text-to-speech AI model called VALL-E that can closely simulate a person's voice when given a three-second audio sample. Once it learns a specific voice, VALL-E can synthesize audio of that person saying anything — and do it in a way that attempts to preserve the speaker's emotional tone," reported Benj Edwards. "Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, speech editing where a recording of a person could be edited and changed from a text transcript (making them say something they originally didn't), and audio content creation."
"Microsoft trained VALL-E's speech synthesis capabilities on an audio library, assembled by Meta, called LibriLight. It contains 60,000 hours of English language speech from more than 7,000 speakers, mostly pulled from LibriVox public domain audiobooks. For VALL-E to generate a good result, the voice in the three-second sample must closely match a voice in the training data," said the report. "In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the 'acoustic environment' of the sample audio" — for example, making it sound like a recording from a telephone call.
According to the report, Microsoft engineers know this technology could be dangerous in the wrong hands, being used to create malicious "deepfakes." A system that convincingly fakes people's voices could do everything from discrediting celebrities or politicians with fake racist quotes, to discrediting a former spouse in a custody dispute. It could even be used to create virtual pornography of a person without their consent, or be used in wire fraud by impersonating a CEO to trick companies into transferring their money.
Research has shown that fraud involving deepfakes may have already enabled up to $35 million worth of fraud in the economy.
"Perhaps owing to VALL-E's ability to potentially fuel mischief and deception, Microsoft has not provided VALL-E code for others to experiment with, so we could not test VALL-E's capabilities," said the report. "The researchers seem aware of the potential social harm that this technology could bring. For the paper's conclusion, they write: 'Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. To mitigate such risks, it is possible to build a detection model to discriminate whether an audio clip was synthesized by VALL-E. We will also put Microsoft AI Principles into practice when further developing the models.'"
Leave a Comment
Related Post