Now, in this video, we're going to discuss the limits of working with just a single neuron. So so far, in the past video, we saw all the different ways that a single neuron would be able to handle coming up with the AND gate, the OR gate, the NOT OR and the NOT AND gates. Now, we will see the limits when we work with the XOR, or the Exclusive OR gate. And for those of you that have taken computer science courses, perhaps, you're familiar with the XOR gate. For those of you that are not, the idea of the XOR gate is to only pass through. If either one or the other of our inputs are true, but not both of them make true, so we see it both are false, we return false. But also, if both are true, then we return false, only if exactly one of them are true, do we return true. So can we create a set of weights, such as a single neuron can output this property that we see here? And it turns out that we can't. And if we want to, what we're going to need to do is actually create another layer, so I'll pass in our input values of X1, X2, as well as our intercept. And then, we'll create another layer as we do with our Feedforward neural networks, and see how using two layers we can come up with this XOR gate. So the concept is, if it's going to be XOR. We want one of the outputs in the second layer, to actually be equivalent to the OR gate. And the other one to be equal to the not AND, so the idea is, if we have, either one or zero, so X1 equal to 1, or X2 equal to 1, or both equal to 1, then the OR gate will return one. But it won't return one for, the only one it won't return, one for us if they're both 0. And then, from the not ANDs, it will return the true value or turn one for all of them, except for both one and one being true. And then, we can take the outputs of the OR gate, Andy, not AND gate. And in the second level, add on another AND gate. And that will give us our XOR function. So if we think about that, if we start with 00, then that will pass zero for the OR gate, so then we'd end up, when we take the the AND gate at the second level, no matter what, if one of them is 0, then we automatically end up with a 0, so that is correct, in that both the inputs are zero, and with our XOR gate, as we see here in the table, it should be 0. Now, if one of them is a one and the other zero, the XOR gate will return a one, and then, not AND gate, which will turn one for every single value, except for ANDs, will both be 1. And in that second level, if we take the end of both one and one, we will output one, so will get the correct value. And then, finally, we want a 0. If both the X1 and the X2 are equal to 1, are both true? And are not AND gate. If we have one in, one will pass through a 0. So even though our OR gate would pass through one, the AND gate at the second level of the one and 0, will end up ensuring that it ends up passing is 0. So that will ensure equivalent to the XOR final row that we pass out as 0. So again, the XOR gate will just be a combination of an OR gate, and not AND gate. Which we learned just above, and then, we can pass that through in the second layer with an AND gate. And that will output the correct output for XOR gate, and we see that in practice where we define our XOR gate as the combination of an AND gates of the output of both the OR gate and the not AND gate, we pass in that cmd virtual pass out the ones or zeros accordingly. Add on that AND gate, and when we test that, we get the output that we would expect, where we have the zero for the both of them being false, and is 0 for both men being true. And then, true, if just one of them is equal to true. So that closes out our discussion in regards to working with that XOR gate, and adding on the extra layer, and seeing how we can come up with more complex boundaries once we move to multiple layers. Now, we discussed during lecture the actual matrix weights taking our input, how that's transformed into the first layer, and then, into the second hidden layer, and then, eventually into our output. What we're going to do here is make that more concrete by actually coming up with some random weights, as well as some random inputs, and see how these matrix sizes transform, as we go from our input through to our output. So here, we're going to start with three weight matrices, W_1, W_2, and W_3, representing the weights in each layer. And the convention for these matrices is that WIJ gives the weight from neuron I, from the prior layer, two neuron J up until the next layer. So the weight of moving from I up until J. Now, a vector x_in is going to represent a single input as we discussed during lecture. We discussed just working with a single input, as well as the full data set of inputs, and x_mat_in is going to represent what a toy version of a full data set with just 7 rose. And the goal for our exercise here, is saying 4 input x_in, or going to calculate the inputs and outputs to each layer, as we move from our linear combination, which it outputs some Z value, and then, taking the sigmoid of that value, and then, seeing how that's passed through to each one of the different layers. We're going to write a function then, that does the entire neural network calculation for a single input. Do that again for a matrix of inputs, and then, test are functions that we just created, using our x_in, and our x_mat, which is our toy data set. So let's look at this, W_1, W_2, and W_3, which will highlight for us how these actual weight matrices should look in the back ends. Now, this isn't learning the optimal parameters for us, but it's showing us just one step through the feed forward from the input all the way through to the output, and when we get back to lecture, we'll talk about how we can actually optimize these weights. So here, we start with a 3 by4 matrix for W_1. We have three rows and four different columns. W_2 is going to be 4 by 4. And then, column three is going to be 4 by 3, and while I say those numbers, you should be thinking of how we're transforming our 3 dimensional vector from X1, X2, X3. We take our 3 by 4 matrix to expand that into a four vector, and then, keep that as a four vector by multiplying that by a 4 by 4 matrix, and then, a 4 by 3 matrix to ensure that we have three outputs. So that's the idea of W_1, W_2, and W_3. So our inputs just going to be these three values, .5, .8, and .2. And then, our toy data set, which is going to have 7 Rose, is going to also have the three columns, where we have as the first entry, the same entry that we have for x_in, but also seeing how this can expand to six more rows, or each row will have the same amount of columns. We're defining here the softmax for a vector, which is just going to allow us to output probabilities, for a single vector, and then, we do the same for a full matrix. So we run this, and we see our output as mentioned that we have for our W_1, 3 by 4 matrix. That's going to be multiplied by a row vector, so 1 by 3 multiplied by a 3 by 4 should output a one by four matrix, so it should be our first hidden layer. And then, we see the matrix for our toy data set, whereas, 7 rows. And if we imagine this is a 7 by 3, and when we multiply this by W_1. We end up in our first hidden layer with Outputs for every single one of the different rows. Sole be a 7 by 4 matrix, and we'll see this in just a second. So if we pass in, let's first pass in here just the x_in. And we take the dot product. Then, as mentioned, that we will get the linear combination. Here, we're looking at Z2, which is going to just be that linear combination, taking the broader product of x_in and our matrix W_1, and we end up with this four vector as mentioned. We're then just going to take the sigmoid of that output, so we got the linear combination. Now, we take the sigmoid of that, and we still have the same shape, but now, it's the sigmoid of each one of those outputs, and now, that output will feed, Into the next layer. And again that W_2 was a four by four matrix, and here we have a one by four vector, so we'll end up with again, our z_3 being a one by four vector. And once we have that linear combination, we can again take the sigmoid and that will be the output into the final layer. And z_4 will be the dot product of A_3 and W_3, where again, W_3 is going to be that four by three matrix to ensure that it matches up with the one by four matrix and it outputs a one by three vector. And then we take that z_4 and call a soft_max to see probabilities for each one of the different values. So we see that for the different classes that we predict that the first class is the most likely. And that's the idea of feeding through this neural network up until that soft_max to come up with a complex solution to our classification problem. Now quickly, I want to show you what this looks like if we were to pass in the full matrix. So we run x_mat_in and instead of it just being one row, now we have all seven rows being passed through. And then we have that seven by four matrix. We can take the dot product of that output by the four by four and we still have that seven by four matrix and then we take the sigmoid of the Z3. So still the same shape, but now taking the sigmoid of each one of those values. We can then take the dot product so that we can get the output of the linear combination of each of those, but only outputting three different values. And we can take the soft_max and then we can see the probabilities for each one of the different values from the output of that original matrix by taking the soft_max. Now just to see how that computes all the way through from beginning through to the end, we create a function called the soft_max_vec which will just be what we pass through the sigmoid of the dot product of x and W_1 and W_2 and W_3. And then we can do the same, and just instead of passing in, it will all be the same function as we just mentioned, the x will work just as well whether it's a matrix or just an input. We do create two different functions and we can pass that out, and we can see that we have the solutions desired. All right, well that closes out our video here working with, from beginning to end, a neural network. And once we get back into lecture, we'll discuss how to actually optimize this model so that we're not just looking at random weights, but eventually the optimal weights using what we learned with gradient descent, and then something called back Propagation. All right, I'll see you there.