June 2024 | This Month in Generative AI: Moving Through the Uncanny Valley (Pt. 1 of 2)
News and trends shaping our understanding of generative AI technology and its applications.
First coined by Japanese roboticist Masahiro Mori in the 1970s, the term uncanny valley describes a phenomenon that occurs when a humanoid robot, or an image or video of a computer-generated human, becomes more human-like. There is a point at which the humanoid depiction becomes eerily similar to humans but is still distinguishable from real humans, causing a significant drop in our emotional comfort and acceptance. This transition is known as the uncanny valley. A humanoid depiction is said to exit the uncanny valley when it becomes so realistic that it is indistinguishable from a real person.
I have previously discussed the results of an earlier study that found that even in 2022, GAN-generated faces had passed through the uncanny valley. In particular, 315 paid online participants were shown — one at a time — 128 faces, half of which were real. The participants were asked to classify each as either real or synthetic. The average accuracy on this task was 48.2%, close to chance performance of 50%, with participants equally as likely to say that a real face was synthetic as vice versa.
Unlike GANs, today's generative AI tools (e.g., Adobe Firefly, Midjourney, and DALL-E) deploy a generative approach known as text-to-image or diffusion-based synthesis.
Trained on billions of images with an accompanying descriptive caption, a diffusion model progressively corrupts each training image until only visual noise remains. The model then learns to denoise each image by reversing this corruption. This model can then be conditioned to generate an image that is semantically consistent with a text prompt such as “a professional photo of a middle-aged executive.”
Unlike GANs that can only generate images of a single category like faces, cats, or landscapes, diffusion models afford much more rendering control and are limited only by the imagination of the prompter.
In collaboration with Professor Sophie Nightingale and Lex McGuire of Lancaster University, we have launched a new perceptual study to determine whether diffusion-based faces, like GAN images, have passed through the uncanny valley.
In this study, participants are shown, one at a time, a GAN-generated, diffusion-generated, or real face. Participants are asked to classify each face as either real or synthetic — they are not explicitly told about the difference between GAN-generated and diffusion-generated faces.
Although we are still collecting data, we have completed a pilot study from 20 participants. The average accuracy for all three categories is 62%, only slightly better than chance performance of 50%. We also find that there is little difference in accuracy between GAN- and diffusion-generated images. You can test yourself on a set of 24 images.
Interestingly, one of our participants performed much better than the rest. With an accuracy across all three image classes of 85%, they significantly outperformed other participants. It could be they got lucky, or it could be that they have a particularly well-trained eye.
In the world of facial recognition, there are so-called super recognizers who have a seeming superpower to recognize people even after only seeing a single photo of them. These super-recognizers have been employed by law enforcement agencies to detect fugitives from CCTV footage. We plan to investigate whether similar super-recognizers exist for detecting AI-generated faces in images. Get in touch with me (hfarid@berkeley.edu) if you find yourself performing particularly well on the quiz linked above.
These preliminary results suggest that, like GAN-generated faces, diffusion-generated faces have passed through or are rapidly passing through the uncanny valley. This does not mean that all diffusion-generated images are perceptually indistinguishable from real images, because in our study participants only view a single face from the neck up. Generally speaking, the more complex the depicted scene, the more likely it is that the image will contain some structural or gravity-defying artifact that a keen eye will detect. If, however, generative AI continues along its current trajectory, it seems likely that sooner or later it is going to be very difficult to perceptually distinguish the real from the fake.
We will soon be launching a similar perceptual study to examine our ability to distinguish between real and AI-generated voices. If my performance on the pilot study is any indication, I predict that AI-generated voices have already passed through the uncanny valley. At the same time, I think that AI-generated videos and face-swap and lip-sync deepfake videos are still on the other side of the uncanny valley, but I don't expect that to be the case for very long.
So what can be done?
As AI-generated content becomes indistinguishable from “real” content, the work of the Content Authenticity Initiative (CAI), where I’m an advisor, becomes increasingly important. Using the underlying C2PA open technical standard, the CAI seeks to accelerate adoption of provenance labeling, whereby viewers can quickly inspect digital assets using Content Credentials.
With over 3,000 members, the CAI is working with generative AI companies, hardware and software providers, news and social media companies, and many others to establish Content Credentials as a digital industry standard.
Author bio: Professor Hany Farid is a world-renowned expert in the field of misinformation, disinformation, and digital forensics. He joined the Content Authenticity Initiative (CAI) as an advisor in June 2023. The CAI is an Adobe-led community of media and tech companies, NGOs, academics, and others working to promote adoption of the open industry standard for content authenticity and provenance.
Professor Farid teaches at the University of California, Berkeley, with a joint appointment in electrical engineering and computer sciences at the School of Information. He’s also a member of the Berkeley Artificial Intelligence Lab, Berkeley Institute for Data Science, Center for Innovation in Vision and Optics, Development Engineering Program, and Vision Science Program, and he’s a senior faculty advisor for the Center for Long-Term Cybersecurity. His research focuses on digital forensics, forensic science, misinformation, image analysis, and human perception.
He received his undergraduate degree in computer science and applied mathematics from the University of Rochester in 1989, his M.S. in computer science from SUNY Albany, and his Ph.D. in computer science from the University of Pennsylvania in 1997. Following a two-year post-doctoral fellowship in brain and cognitive sciences at MIT, he joined the faculty at Dartmouth College in 1999 where he remained until 2019.
Professor Farid is the recipient of an Alfred P. Sloan Fellowship and a John Simon Guggenheim Fellowship, and he’s a fellow of the National Academy of Inventors.