Now let's walk through some of the important terminology that we should keep in mind when working with a neural network and here a multilayer perceptron, as well as some of the basics as to how we get from this first layer up until the final layer that we have from the X's is up until the y's. So first off, we have our different weights and those weights will determine how do we combine each one of the different layers along our neural network. So each one of these arrows that will connect X1 to each point, each node within that next layer, as well as all the lines between the second layer and the third layer. These will all signify be specific weights and how to combine each one of these different layers. We have our input layer which is just going to be our input data set. Here just to make it especially clear, we can imagine that this first X1, X2, X3 is just going to be the first row, where X1 is feature 1, X2 is featured 2 and X3 is feature 3. We then have our hidden layers and those are going to be all of these purple nodes that fall between our input layer and what we will define right now as our output layer. So everything between our input layer on output layer are going to be called are hidden layers. And those hidden layers as we specified as we walked through the Python syntax can be defined however many layers we'd like. We can say we want five hidden layers and they would be five different columns of notes in between our input layer and our output layer. That something that we would predefine an all the weights would connect to each of those in order to learn this complex model feeding from our input layer through the hidden layers out through the output layer, which will be our actual predictions. The weights that we said over the different arrows are going to be represented by matrices. And each of those different matrices will again just be the way that we combine each layer step by step. And those matrices will have to be of the appropriate shape to ensure that if we have an input that's going to be three vector, that it transforms it into a four vector in the next layer. And then maintains that four vector in the next layer, and then brings that down to a three vector in that final layer, and I'll walk through this in just a second. Our net input will be the sum of the weighted inputs, so that's going to be your Z values, and that's going to be again similar to your linear regression. So X1 times some weight plus X2 times some weight plus X3 times some weight or eq will equal one of your values of Z. And then we will have four different values for that first layer. So our Z is actually going to be a four vector as well, our Z2, and then our Z3 will be at three vector. We then finally have our activation values and those activation values are just going to be taking those Z values that we just discussed and passing them through our activation function. So I'm going to briefly skip over a0 here, but a1 should be a four vector as well, or we just take that Z1 and pass it through for example, each one of those different values in that four vector, pass it through the sigmoid function. We can do the same for a2 passing through Z2. And then for a3 we can pass Z3, probably here through softmax layer in order to give the predicted probabilities that we'd want output for this classification problem that we have here. Now if we go back to a0, a0 is signifying that we want any a to be pass into the next layer. Even though we're not doing anything to the X1, X2, and X3, that is going to be fed as input into the next layer, so will often call that a0 just for simplicity's purposes. So imagining working with just a single row of data again. So if we have a single data point with a certain amount, so that's a single row with a certain amount of features. Our W1 or first weight would be a 3 by 4 matrix, taking the input values of X1, X2 and X3, which would be a one by three matrix and multiplying that one by three matrix by the 3 by 4 matrix W 1. And that will result in our 1 by 4 matrix which will be our Z1. We could then pass all the values of Z1 through our activation function. And that would result in a1 another four vector. In order to make this as clear as possible, and as we saw in the prior side, we can think of our input values as a0. And every a, a0, a1, a2 will be the value that's passes input into the next layer. And again our Z1 is equal to the dot product of x and W 1 and our a1 is just going to be the activation function of Z1 and that will be passed on to the next layer. Now to take this a step further to see how this computes through the entire process of our neural network for a single row. We are planning to start with a vector representing that row in our case, that was a row vector of length 3. And plan to end with an output of a row vector of length 3 as well, which means in this example we're probably performing classification with output of three classes. Now we showed how we got Z1 as a dot product of x and W 1. That allows us to calculate a1 by taking the aggravation function of Z1. Z2 or the second layer of Z is calculated as a linear combination of each of those a1's that we just calculated. And in order to get a linear combination of the correct dimensions, we have W2 which matches up the shape of a1 and our eventual shape of the values we want a2. A2 is then again just the activation function of Z2. And then Z3 is then again the linear combination of that prior output of a2. So we need a new weight matrix W3. And once we get the linear combination of a2 and since this is the final layer, we just take the softmax of Z3 in order to give us the predicted probabilities for each one of our different classes and that will be our predicted, y. Now, in practice were not just working with a single row at a time, but rather we'd be working with an entire data set worth of rows. But in order to calculate this generalized version of our multilayer perception, our equations should look very similar, or exactly the same? So this time we're in putting an n by 3 matrix, where n is the number of rows still working with three columns though. And our output should also be an n by 3 matrix with a predicted probability for each one of our different rows. Now the math should be the same though, as we saw before, but this time the dot product of x and W is now just going to be an n by 4 matrix or whatever the size of our next layer is. Rather than it just being 1 by 4 we're now n by 4. So if you imagine again that that X is an n by 3, we can have our be 3 by 4 so that we'd end up with an end by 4 matrix for Z1. We can then take the activation function for all of the outputs that we get for Z1 and end up with our output from that first layer. And again we have the appropriate matrix to get the linear combination of each one of those outputs for each one of the different rows and end up with Z2. We pass each of those Z2 through the activation function to get a2. That a2 will be the output for the second layer, and input into the third layer and that will give us Z3 when we take the linear combination of a2 and W 3. And taking that Z3 now for multiple rows, we can take the Softmax and end up with predicted probabilities for each one of the three classes for all of our n rose. So that expands it out to the amount of rose within our entire data set. Now there are many deep learning approaches which we're going to discuss throughout this course. And along with these basic groupings, there's also much more being developed. So quick overview, we have the neural network models, which are just going to be your multilayer perceptron and feed forward networks. And this is going to be applied to many traditional predicted problems such as just classification and regression that we've discussed. So, We have recurrent neural networks, and we have here the classes of RNN and LSTM, long short term memory, RNN is recurrent neural network. And this is going to be useful for modeling sequences. So this will be useful for time series, where, maybe each one of the different steps along the way are dependent on prior steps. Or sentence prediction, where each one of the different words may be dependent on prior words. We'll have convolutional neural networks or CNN, and that's going to be very useful for a feature and object recognition in visual data, as it will take all of the surrounding features and take them in as context moving forward as well. As well as being used at times for forecasting as well, where I can take points on either end or see some type of patterns within the data, in order to predict future values. And then it can also be used with unsupervised pre-trained networks with Autoencoders, Deep Belief Networks, and Generative Adversarial Networks. And there's going to be many uses including, generating actual images, labeling some outcomes, as well as dimensionality reduction, using deep learning. And we'll discuss many of these throughout this course. Now that closes our introduction to neural networks. In the next video, we will begin to discuss the optimization, that's needed in order to come up with our weights, using gradient descent, which will be a key factor in learning each one of our neural network models. Alright, I'll see you there.