We've all been hearing that deep neural networks work really well for
a lot of problems, and it's not just that they need to be big neural networks,
is that specifically, they need to be deep or to have a lot of hidden layers.
So why is that?
Let's go through a couple examples and try to gain some intuition for
why deep networks might work well.
So first, what is a deep network computing?
If you're building a system for face recognition or
face detection, here's what a deep neural network could be doing.
Perhaps you input a picture of a face then the first layer of the neural network
you can think of as maybe being a feature detector or an edge detector.
In this example, I'm plotting what a neural network with maybe 20 hidden units,
might be trying to compute on this image.
So the 20 hidden units visualized by these little square boxes.
So for example, this little visualization represents a hidden unit that's
trying to figure out where the edges of that orientation are in the image.
And maybe this hidden unit might be trying to figure out
where are the horizontal edges in this image.
And when we talk about convolutional networks in a later course,
this particular visualization will make a bit more sense.
But the form, you can think of the first layer of the neural network as looking at the
picture and trying to figure out where are the edges in this picture.
Now, let's think about where the edges in this picture by grouping together
pixels to form edges.
It can then detect the edges and group edges together to form parts of faces.
So for example, you might have a low neuron trying to see if it's finding an eye,
or a different neuron trying to find that part of the nose.
And so by putting together lots of edges,
it can start to detect different parts of faces.
And then, finally, by putting together different parts of faces,
like an eye or a nose or an ear or a chin, it can then try to recognize or
detect different types of faces.
So intuitively, you can think of the earlier layers of the neural network as
detecting simple functions, like edges.
And then composing them together in the later layers of a neural network so
that it can learn more and more complex functions.
These visualizations will make more sense when we talk about convolutional nets.
And one technical detail of this visualization,
the edge detectors are looking in relatively small areas of an image,
maybe very small regions like that.
And then the facial detectors you can look at maybe much larger areas of image.
But the main intuition you take away from this is just finding simple things
like edges and then building them up.
Composing them together to detect more complex things like an eye or a nose
then composing those together to find even more complex things.
And this type of simple to complex hierarchical representation,
or compositional representation,
applies in other types of data than images and face recognition as well.
For example, if you're trying to build a speech recognition system,
it's hard to revisualize speech but
if you input an audio clip then maybe the first level of a neural network might
learn to detect low level audio wave form features, such as is this tone going up?
Is it going down?
Is it white noise or sniffling sound like [SOUND].
And what is the pitch?
When it comes to that, detect low level wave form features like that.
And then by composing low level wave forms,
maybe you'll learn to detect basic units of sound.
In linguistics they call phonemes.
But, for example, in the word cat, the C is a phoneme, the A is a phoneme,
the T is another phoneme.
But learns to find maybe the basic units of sound and
then composing that together maybe learn to recognize words in the audio.
And then maybe compose those together,
in order to recognize entire phrases or sentences.
So deep neural network with multiple hidden layers might be able to have the earlier
layers learn these lower level simple features and
then have the later deeper layers then put together the simpler things it's detected
in order to detect more complex things like recognize specific words or
even phrases or sentences.
The uttering in order to carry out speech recognition.
And what we see is that whereas the other layers are computing, what seems like
relatively simple functions of the input such as where the edge is, by the time
you get deep in the network you can actually do surprisingly complex things.
Such as detect faces or detect words or phrases or sentences.
Some people like to make an analogy between deep neural networks and
the human brain, where we believe, or neuroscientists believe,
that the human brain also starts off detecting simple things like edges in what
your eyes see then builds those up to detect more complex
things like the faces that you see.
I think analogies between deep learning and
the human brain are sometimes a little bit dangerous.
But there is a lot of truth to, this being how we think that human brain works and
that the human brain probably detects simple things like edges first
then put them together to from more and more complex objects and so that
has served as a loose form of inspiration for some deep learning as well.
We'll see a bit more about the human brain or
about the biological brain in a later video this week.