We will now look at some examples to illustrate the training and performance of the neural network. When considering the application of the neural network, there are a number of decisions that need to be taken beforehand about the network topologies. These include how many hidden layers to use, how many nodes should be used in each layer, how to use the output layer to represent the thematic classes, and what value to assign to the learning parameter. Generally, the first layer will have as many nodes as there are elements through the pixel vector. Often there will be as many output layer nodes as there are classes unless some form of c [inaudible] to reduce their number. Generally, the number of nodes in the hidden layer should not be less than the number of output layer nodes. Note that the multi-layer perceptron has all the connections in place that we described earlier, that is the output of a processing element or a node in any layer, is connected to every node in the next layer; that is called fully-connected. Later, in the context of the convolutional neural network, we will not use all of those connections. By way of illustrations, we start with a very simple example involving two classes in a two-dimensional vector space. Note that the classes are not linearly separable, as seen in the figure. We have chosen a network with two nodes in the hidden layer and two in the output layer. The network equations are shown explicitly on the right hand network diagram. Note also that we have chosen a zero threshold theta for the hidden layer processing elements, and we have also chosen b equals one in the activation function and eta equals two as the learning perimeter. The network was initialized with a set of weights shown in the first row of this table. As seen, the error before iteration was 0.461. The network was then trained for 250 iterations at which the error had been reduced to 0.005. At the same time, the whites can be seen to be converging to fixed, or final values. We stop training at 250 iterations and use the parameter values at that point. On the right hand axis, we have plotted the arguments of the two hidden layer PEs before the application of the activation functions. Effectively, when equated to zero, they've implemented linear separating surfaces. The activation function then places a response for a given pixel on either side of those surfaces. Each surface therefore segments the data space into two regions. In the output lab PE, [inaudible] responses segment the full space into the two class regions shown. Effectively, it implements a logical OR operation that is shown mathematically in the bottom table, which shows explicitly how the output layer functions for each of the four possibilities of patterns being placed on either side of the first two surfaces. Having trained the network, we need to see how successful it is in separating sets of pixel vectors that it has not previously seen. In the table here, there are eight new pixels. They can also be seen in the vector space. Patterns A to D are in class one, while patterns E to H are in class two as is evident on the diagram. The table shows the intervening calculations and the final classification by the network for each pixel. All testing pixels have been successfully labeled. We now come to a real remote sensing example taken from a 1995 paper that is listed on this slide. The data-set consisted of the six non-thermal bands of a 900 by 900 pixel segment of the thematic mapper scene recorded over Tucson, Arizona on 1 April 1987. There are 12 classes evident in the same. They were chosen by the authors. The band for the infrared image shown here does not make those classes easily seen, but the grid structure of Tucson streets is evident. In keeping with mastery [inaudible] exercises of time involving neural networks, the authors chose a network was just one hidden layer. Since it was six bands, the input layer consisted of six nodes or processing elements. Those nodes also scaled the data to the range between zero and one. Since there were 12 classes, the output layer was chosen to have 12 nodes with each representing a single class. The scale of the outputs was chosen such that during training, an output of 0.9 on a node indicated a target class while a value of 0.1 means that the class does not respond to the training pixel being presented. The hidden layer was chosen to have 18 nodes. Since the authors decided to compare the neural network results against those obtained with a maximum likelihood classifier, the choice of the hidden layer nodes was based on having the same number of parameters to determine as for the maximum likelihood rule. This slide shows the information classes and the numbers of training and testing pixels used by the authors. Although the network was allowed to run for 50,000 iterations or epochs, the error has stabilized after about 15,000 iterations. Note that more than 96 percent of the training pixels are properly handled once the network has reached that number of iterations. It is because so many iterations are needed to train a neural network in practice, that training time can be so excessive. The network performance using unseen testing pixels was a very good 93.4 percent accuracy. If training was stopped after 10,000 iterations, the network was still capable of achieving 92 percent accuracy. If stopped at 20,0000 iterations, that improved marginally to 93 percent. A maximum likelihood classifier was also run on the same data set, although there is no indication as to whether it was optimized for the choice of sets of spectral classes to represent the specified information classes, which we will do in other examples in module 3. Nevertheless, the maximum likelihood classifier achieved 89.5 percent accuracy on the testing data, but it was 10 times faster to train. This slide shows the thematic map produced by the neural network on the right-hand side, along with the key to the colors. The authors included two variations to the standard neural network training process to improve the learning rate. The first wants to add a momentum term to the gradient descent rule used to adjust the weights. On the top of this slide, we summarize the standard gradient descent adjustment. On the bottom in green, an additional term is added. It is chosen as a proportion of the previous weight adjustment, which forces a modification to follow the pattern of the previous iteration. Another perimeter is introduced in this process, Alpha, which controls the degree of momentum used. The second modification was to adjust the learning and momentum rate adaptively in order to improve convergence. That was done every fourth iteration according to the rule shown on the top of the slide. Note that the convergence and the ultimate result of neural network training can be affected by the initial choice of words and that the initial set cannot all be the same. Otherwise, of course, the network will not train. More details on this example will be found in the paper. However, this original neural network approach is now rarely used. We introduced it here as preparation for the more recent development of the convolutional neural network, which we commence in the next lecture. When we come to the convolutional neural network, we will often talk about deep-learning. Simply put, network depth is described by the number of hidden layers. A deeper network has more. The idea is that when there are more hidden layers, the network should be more powerful. The network is in more difficult or time consuming to train because of the vastly larger number of unknowns that had to be found. When we come to the convolutional neural network, we will find that increased network depth is possible because we don't use all the connections between the nodes. By reducing the number of connections substantially, we can have more layers and still train the network. As a final comment on the operation of the layers in the neural network, this slide gives a different perspective on how the simple network of the first simpler example operates. Earlier, we regarded the hidden layer processing elements as implementing two decisions, with the third layer acting on those decisions as a logical O function. We could also view the hidden layer operation as in this slide if we examine the data as it appears at the output of the first layer processing elements. Now, represented by the variables J1 and J2, the data has been transformed into a linearly separable sit, which the output layer now handles. Again, this is a simple summary of what we've learned so far about neural networks. The second and third questions here will become important when we look at the convolutional neural network in the next series of lectures. What properties does it have to have in order that the back-propagation training algorithm can be made to work.