An innovative AI instrument named VASA-1, which can transform a static image or sketch of an individual and a pre-existing audio file into a realistic speaking face in real-time, has been disclosed by Microsoft Research Asia. This tool is capable of producing facial expressions, head movements, and suitable lip movements that correspond to a speech or song for a given static image. Numerous examples have been posted on the project’s webpage by the researchers, and the outcomes are convincing enough to deceive individuals into believing they are genuine.
Despite the fact that the lip and head movements in the examples might appear somewhat mechanical and out of sync upon close examination, it’s evident that this technology could be exploited to rapidly and effortlessly produce deepfake videos of actual individuals. The researchers are conscious of this possibility and have chosen not to make available “an online demo, API, product, additional implementation details, or any related offerings” until they are confident that their technology “will be used responsibly and in compliance with appropriate regulations.” They did not, however, indicate whether they intend to put in place specific safeguards to deter malicious individuals from using them for harmful purposes, such as creating deepfake pornography or disinformation campaigns.
Despite its potential for misuse, the researchers believe their technology offers numerous advantages. They suggested that it could be used to promote educational equality and improve accessibility for those with communication difficulties, possibly by providing them with an avatar that can communicate on their behalf. They also suggested that it could provide companionship and therapeutic support for those in need, hinting that VASA-1 could be incorporated into programs that provide access to AI characters with whom people can converse.
As per the paper released along with the announcement, VASA-1 was trained on the VoxCeleb2 Dataset, which includes “over 1 million utterances from 6,112 celebrities” extracted from YouTube videos. Despite being trained on real faces, the tool is also effective on artistic images like the Mona Lisa, which the researchers humorously paired with an audio file of Anne Hathaway’s popular cover of Lil Wayne’s Paparazzi. It’s so enjoyable that it’s worth watching, even if you’re skeptical about the benefits of such a technology.