Originally published in nature, July 25, 2024.
The models that underpin artificial-intelligence systems such as ChatGPT can be subject to attacks that elicit harmful behaviour. Making them safe will not be easy.
In 2015, computer scientist Ian Goodfellow and his colleagues at Google described what could be artificial intelligence’s most famous failure. First, a neural network trained to classify images correctly identified a photograph of a panda. Then Goodfellow’s team added a small amount of carefully calculated noise to the image. The result was indistinguishable to the human eye, but the network now confidently asserted that the image was of a gibbon1.
This is an iconic instance of what researchers call adversarial examples: inputs carefully crafted to deceive neural-network classifiers. Initially, many researchers thought that the phenomenon revealed vulnerabilities that needed to be fixed before these systems could be deployed in the real world — a common concern was that if someone subtly altered a stop sign it could cause a self-driving car to crash. But these worries never materialized outside the laboratory. “There are usually easier ways to break some classification system than making a small perturbation in pixel space,” says computer scientist Nicholas Frosst. “If you want to confuse a driverless car, just take the stop sign down.”
Fears that road signs would be subtly altered might have been misplaced, but adversarial examples vividly illustrate how different AI algorithms are to human cognition. “They drive home the point that a neural net is doing something very different than we are,” says Frosst, who worked on adversarial examples at Google in Mountain View, California, before co-founding AI company Cohere in Toronto, Canada.
To continue reading this article, click here.