When you look at a photograph of a cat, chances are that you can recognize the pictured animal whether it’s ginger or striped — or whether the image is black and white, speckled, worn or faded. You can probably also spot the pet when it’s shown curled up behind a pillow or leaping onto a countertop in a blur of motion. You have naturally learned to identify a cat in almost any situation. In contrast, machine vision systems powered by deep neural networks can sometimes even outperform humans at recognizing a cat under fixed conditions, but images that are even a little novel, noisy or grainy can throw off those systems completely.
A research team in Germany has now discovered an unexpected reason why: While humans pay attention to the shapes of pictured objects, deep learning computer vision algorithms routinely latch on to the objects’ textures instead.
This finding, presented at the International Conference on Learning Representations in May, highlights the sharp contrast between how humans and machines “think,” and illustrates how misleading our intuitions can be about what makes artificial intelligences tick. It may also hint at why our own vision evolved the way it did.
Cats With Elephant Skin and Planes Made of Clocks
Deep learning algorithms work by, say, presenting a neural network with thousands of images that either contain or do not contain cats. The system finds patterns in that data, which it then uses to decide how best to label an image it has never seen before. The network’s architecture is modeled loosely on that of the human visual system, in that its connected layers let it extract increasingly abstract features from the image. But the system makes the associations that lead it to the right answer through a black-box process that humans can only try to interpret after the fact. “We’ve been trying to figure out what leads to the success of these deep learning computer vision algorithms, and what leads to their brittleness,” said Thomas Dietterich, a computer scientist at Oregon State University who was not involved in the new study.
To do that, some researchers prefer to look at what happens when they trick the network by modifying an image. They have found that very small changes can cause the system to mislabel objects in an image completely — and that large changes can sometimes fail to make the system modify its label at all. Meanwhile, other experts have backtracked through networks to analyze what the individual “neurons” respond to in an image, generating an “activation atlas” of features that the system has learned.
But a group of scientists in the laboratories of the computational neuroscientist Matthias Bethge and the psychophysicist Felix Wichmann at the University of Tübingen in Germany took a more qualitative approach. Last year, the team reported that when they trained a neural network on images degraded by a particular kind of noise, it got better than humans at classifying new images that had been subjected to the same type of distortion. But those images, when altered in a slightly different way, completely duped the network, even though the new distortion looked practically the same as the old one to humans
Robert Geirhos, a graduate student in computational neuroscience at the University of Tübingen, sought to understand why the performance of deep learning computer vision algorithms is so vulnerable to noise in images. The question led to the discovery that the systems prioritize texture over shape.
Courtesy of Robert Geirhos
To explain that result, the researchers thought about what quality changes the most with even small levels of noise. Texture seemed the obvious choice. “The shape of the object … is more or less intact if you add a lot of noise for a long time,” said Robert Geirhos, a graduate student in Bethge’s and Wichmann’s labs and the lead author of the study. But “the local structure in an image — that gets distorted super fast when you add a bit of noise.” So they came up with a clever way to test how both humans and deep learning systems process images.
Geirhos, Bethge and their colleagues created images that included two conflicting cues, with a shape taken from one object and a texture from another: the silhouette of a cat colored in with the cracked gray texture of elephant skin, for instance, or a bear made up of aluminum cans, or the outline of an airplane filled with overlapping clock faces. Presented with hundreds of these images, humans labeled them based on their shape — cat, bear, airplane — almost every time, as expected. Four different classification algorithms, however, leaned the other way, spitting out labels that reflected the textures of the objects: elephant, can, clock.
“This is changing our understanding of how deep feed-forward neural networks — out of the box, or the way they’re usually trained — do visual recognition,” said Nikolaus Kriegeskorte, a computational neuroscientist at Columbia University who did not participate in the study.
Odd as artificial intelligence’s preference for texture over shape may seem at first, it makes sense. “You can think of texture as shape at a fine scale,” Kriegeskorte said. That fine scale is easier for the system to latch on to: The number of pixels with texture information far exceeds the number of pixels that constitute the boundary of an object, and the network’s very first steps involve detecting local features like lines and edges. “That’s what texture is,” said John Tsotsos, a computational vision scientist at York University in Toronto who was also not involved in the new work. “Groupings of line segments that all line up in the same way, for example.”
Geirhos and his colleagues have shown that those local features are sufficient to allow a network to perform image classification tasks. In fact, Bethge and another of the study’s authors, the postdoctoral researcher Wieland Brendel, drove this point home in a paper that was also presented at the conference in May. In that work, they built a deep learning system that operated a lot like classification algorithms before the advent of deep learning — like a “bag of features.” It split up an image into tiny patches, just as current models (like those that Geirhos used in his experiment) initially would, but then, rather than integrating that information gradually to extract higher-level features, it made immediate decisions about the content of each small patch (“this patch contains evidence for a bicycle, that patch contains evidence for a bird”). It simply added those decisions together to determine the identity of the object (“more patches contain evidence for a bicycle, so this is an image of a bicycle”), without any regard for the global spatial relationships between the patches. And yet it could recognize objects with surprising accuracy.
“This challenges the assumption that deep learning is doing something completely different” than what previous models did, Brendel said. “Obviously … there’s been a leap. I’m just suggesting the leap is not as far as some people may have hoped for.”
According to Amir Rosenfeld, a postdoctoral researcher at York University and the University of Toronto who did not participate in the study, there are still “large differences between what we think networks should be doing and what they actually do,” including how well they reproduce human behavior.
Brendel expressed a similar view. It’s easy to assume neural networks will solve tasks the way we humans do, he said. “But we tend to forget there are other ways.”
A Nudge Toward More Human Sight
Current deep learning methods can integrate local features like texture into more global patterns like shape. “What is a bit surprising in these papers, and very compellingly demonstrated, is that while the architecture allows for that, it doesn’t automatically happen if you just train it [to classify standard images],” Kriegeskorte said.
Geirhos wanted to see what would happen when the team forced their models to ignore texture. The team took images traditionally used to train classification algorithms and “painted” them in different styles, essentially stripping them of useful texture information. When they retrained each of the deep learning models on the new images, the systems began relying on larger, more global patterns and exhibited a shape bias much more like that of humans.
And when that happened, the algorithms also became better at classifying noisy images, even when they hadn’t been trained to deal with those kinds of distortions. “The shape-based network got more robust for free,” Geirhos said. “This tells us that just having the right kind of bias for specific tasks, in this case a shape bias, helps a lot with generalizing to a novel setting.”
It also hints that humans might naturally have this kind of bias because shape is a more robust way of defining what we see, even in novel or noisy situations. Humans live in a three-dimensional world, where objects are seen from multiple angles under many different conditions, and where our other senses, such as touch, can contribute to object recognition as needed. So it makes sense for our vision to prioritize shape over texture. (Moreover, some psychologists have shown a link between language, learning and humans’ shape bias: When very young children were trained to pay more attention to shape by learning certain categories of words, they were later able to develop a much larger noun or object vocabulary than children who did not receive the training.)
The work serves as a reminder that “data exert more biases and influences than we believe,” Wichmann said. This isn’t the first time researchers have encountered the problem: Facial recognition programs, automated hiring algorithms and other neural networks have previously been shown to give too much weight to unexpected features because of deep-rooted biases in the data they were trained on. Removing those unwanted biases from their decision-making process has proved difficult, but Wichmann said the new work shows it is possible, which he finds encouraging.
Nevertheless, even Geirhos’ models that focused on shape could be defeated by too much noise in an image, or by particular pixel changes — which shows that they are a long way from achieving human-level vision. (In a similar vein, Rosenfeld, Tsotsos and Markus Solbach, a graduate student in Tsotsos’ lab, also recently published research showing that machine learning algorithms cannot perceive similarities between different images as humans can.) Still, with studies like these, “you’re putting your finger on where the important mechanisms of the human brain are not yet captured by these models,” Kriegeskorte said. And “in some cases,” Wichmann said, “perhaps looking at the data set is more important.”
Sanja Fidler, a computer scientist at the University of Toronto who did not participate in the study, agreed. “It’s up to us to design clever data, clever tasks,” she said. She and her colleagues are studying how giving neural networks secondary tasks can help them perform their main function. Inspired by Geirhos’ findings, they recently trained an image classification algorithm not just to recognize the objects themselves, but also to identify which pixels were part of their outline, or shape. The network automatically got better at its regular object identification task. “Given a single task, you get selective attention and become blind to lots of different things,” Fidler said. “If I give you multiple tasks, you might be aware of more things, and that might not happen. It’s the same for these algorithms.” Solving various tasks allows them “to develop biases toward different information,” which is similar to what happened in Geirhos’ experiments on shape and texture.
All this research is “an exciting step in deepening our understanding of what’s going on [in deep learning], perhaps helping us overcome the limitations we’re seeing,” Dietterich said. “That’s why I love this string of papers.”
Live Caption所提供的在线语言转录,以及语音转化为文本的功能,即使在日常生活中也十分实用。Live Caption不仅能够帮助常人在不方便收听音频的特定环境下完成信息的交互,也能够帮助残障人士实现信息无障碍的交流。
Live Relay功能可帮助聋哑人打电话,可以将对方的语音生成实时文字。
Euphonia项目
谷歌正在研究如何调整其现有的人工智能语音算法,以更好地理解有语音障碍的用户。Google AR 的研究人员正在探索个性化通信模式这一概念以及如何使用人工智能帮助那些无法正常发音的人士。通过和这些非营利组织的合作,谷歌的Project Euphonia项目团队成员记录了ALS患者的声音,然后将这些声音来训练人工智能,从而创建能够识别这些语音的算法和频谱图。
去年,谷歌发布 Google News 新特征 Full Coverage。如今谷歌把 Full Coverage 功能加入到了搜索中,从而更好地组织与搜索主题相关的资源。以搜索「黑洞」为例,谷歌使用机器学习识别不同类型的文章,全景展示与搜索词条相关的故事。此外,Podcasts 也将融入到谷歌搜索中。
但谷歌搜索最令人尖叫的新特征是视觉展示。使用计算机视觉与增强现实,谷歌进一步强化了搜索体验:3D。
例如搜索大白鲨,查看 3D 模式,然后可以直接把它搬到舞台中央!
当然,这一功能并非只是噱头,它还有很强的实用性,例如购买鞋子时,可以把 3D 展示拉入现实场景,看鞋子和自己的衣服是否搭配,从而提升购物体验。
搜索,只是计算机视觉技术应用的场景之一。从今天的大会上,我们可以看到谷歌在 CV 上的研究已经融入谷歌的产品生态中,例如通过 Assistant、Camera 等 APP,用户可以借助 Google Lens 识别菜单中的热门菜品;通过 Google Go,翻译图片中的外语,并进行语音诵读。
在搜索之外,安卓(Android)系统是谷歌今天成功的重要原因之一。在 I/O 大会上谷歌透露,今天全球约有 25 亿台正在运行的安卓设备。它已经成为了谷歌以及大多数其他公司部署新应用最优先的平台。如谷歌 AI 语音助理、人工智能图像识别产品等。Android 也是 Wear OS、Android Auto 和 Google 的流媒体电视平台 Android TV 的基础。
#include <stdio.h>
#include <stdlib.h>
int main()
{
int hang = 1024*8;
int lie = 1024*8;
int c = 0;
int **arr = (int **)malloc(sizeof(int*) * lie);
for(c = 0; c < lie; c++)
{
arr[c] = (int*)malloc(sizeof(int) * hang);
}