Why only one layer of perceptrons? Why not send the output of one layer as the input to the next layer? Combining multiple layers of perceptrons sounds like it'd be a much more powerful model. However, without using nonlinear activation functions, all of the additional layers can be compressed back down into just a single linear layer, and there's no real benefit. You need nonlinear activation functions. Therefore, sigmoid, or hyperbolic tangent, or tanh for short, activation functions started to be used for nonlinearity. At the time, we were limited to just these because we needed a differentiable function since that fact is exploited in back propagation to be the model weights. Modern day activation functions are not necessarily differentiable, and people didn't know how to work with them. This constraint, that activation functions had to be differentiable, could make the networks hard to train. The effectiveness of these models was also constrained by the amount of data, available computational resources, and other difficulties in training. For instance, optimization tended to get caught in saddle points. Instead of finding the global minimum, we hoped it would during gradient descent. However, once the trick to use rectify linear units or ReLUs was developed, then you could have faster training like eight to ten times, almost guaranteed convergence for logistic regression. Building up the perceptron, just like the brain, we can connect many of them together to form layers, to create feedforward neural networks. Really not much has changed in components from the single layer perceptron, there are still inputs, weighted sums, activation functions, and outputs. One difference is that the inputs to neurons not in the input layer are not the raw inputs but the outputs of the previous layer. Another difference is that the way it's connecting the neurons between layers are no longer a vector but now a matrix because of the completely connecting nature of all neurons between layers. For instance, in the diagram, the input layer weights matrix is four by two and the hidden layer weights matrix is two by one. We will learn later that neural networks don't always have complete connectivity which has some amazing applications and performance like with images. Also, there are different activation functions than just the units that function, such as the sigmoid and hyperbolic tangent or tanh activation functions. Each non-input neuron, you can think of as a collection of three steps packaged up into a single unit. The first component is a weighted sum, the second component is the activation function, and the third component is the output of the activation function. Neural networks can become quite complicated with all the layers, neurons, activation functions, and ways to train them. Throughout this course, we'll be using TensorFlow Playground to get a more intuitive sense of how information flows through a neural network. It's also a lot of fun, allows you to customize a lot more hyperparameters, as well as provide the visuals of the wait magnitudes, and how the loss function is evolving over time. This is the linear activation function, is essentially an identity function because the function of x just returns x. This was the original activation function. However, as said before, even with a neural network with thousands of layers, all using a linear activation function, the output at the end will just be a linear combination of the input features. This can be reduced to the input features each multiplied by some constant. Does that sound familiar? It's simply a linear regression. Therefore, nonlinear activation functions are needed to get the complex chain functions that allow neural networks to learn data distributions so well. Besides the linear activation function, where f of x equals x, the primary activation functions back when neural networks were having their first golden age was the sigmoid and tanh activation functions. The sigmoid activation function is essentially a smooth version of the unit step function where asymptotes to zero at negative infinity and asymptotes to one at positive infinity, but there are intermediate values all in between. The hyperbolic tangent or tanh for short is another commonly used activation function at this point, which is essentially just a scaled and shifted sigmoid with its range now negative one to one. These were great choices because they were differentiable everywhere, monotonic, and smooth. However, problems such as saturation would occur due to either high or low input values to the functions, which would end up in the asymptotic Plateau to the function. Since the curve is almost flat at these points, the derivatives are very close to zero. Therefore, training of the weights would go very slow or even halt since the gradients were all very close to zero, which will result in very small step sizes down the hill during gradient descent. Linear activation functions were differentiable, monotonic, and smooth. However, as mentioned before, a linear combination of linear functions can be collapsed back down reduced into one. This doesn't enable us to create the complex chain of functions that we will need to describe our data well. There were approximations of linear activation function, but they were not differentiable everywhere. So, not until much later did people know what to do with them. Very popular now is the rectified linear unit or ReLU activation function. It is nonlinear, so we can get the complex modeling needed, and it doesn't have the saturation in the non-negative portion of the input space. However, due to the negative portion of the input space translating to a zero activation, ReLU layers could end up dying or no longer activating, which can also cause training to slow or stop. There are some ways to solve this problem, one of which is using another activation function called the exponential linear unit or ELU. It is approximately linear and the non-negative portion of the input space, and it's smooth, monotonic and most importantly, non-zero in the negative portion of the input space. The main drawback of ELUs are that they are more computationally expensive than ReLUs due to having to calculate the exponential. We will get to experiment more with these in the next module. If I wanted my outputs to be in the form of probabilities, which activation function should I choose in the final layer? The correct answer is a sigmoid activation function. This is because the range of a sigmoid function is between zero and one, which is also the range for probability. Beyond just the range, the sigmoid function is the cumulative distribution function of the logistic probability distribution whose quantile function is the inverse of the logic which models the log odds. This is why it can be used as a true probability. We will talk more about those reasons later in the specialization. Tanh is incorrect because even though it is a squashing function like a sigmoid, its range ranges between negative one to one which is not the same range as the probability. Furthermore, just squashing tanh into a sigmoid will not magically turn it into a probability because it doesn't have the same properties mentioned above that allows a sigmoid output to be interpreted as a probability. To correctly convert into a sigmoid, first you have to add one and divide by two to get the correct range. Also, to get the right spread, you'd have to divide tanh's argument by two. But you've already calculated tanh, so we will be repeating a bunch of work, and you may as well have just used a sigmoid to start. ReLU was incorrect because its range is between zero and infinity, which is far from the representation of probability. ELU was also incorrect because its range is between negative infinity and infinity.