In this lecture, we will provide an overview of some hard research topics in deep learning. Specifically, we will look at a low bit precision generative adversarial networks, deep voice, and reinforcement learning. Keep in mind that much of the content in this lecture is very quickly evolving. Just as a quick review, deep neural networks are trained by a process of forward propagation, calculating a cost, and a backward propagation, to update the network's weights. These updates utilize gradients of the cost with respect to the weight, which can be very small and generally require a high level of precision. Inference refers to the first part of training which entails using up pre-training network on new data. Because the weights are not being updated, the level of precision needed for inference is much lower. Often training on a large scale or deep network is constrained by the available competition of resources. Recently leading to considerable interest in studying the effects of limited precision data representation and computation on neural networks training. That's what we will talk about next. Most deep learning practitioners use 32 bits of floating point precision to train and test deep networks. However, such high-precision is often not required especially in testing or deploying deep networks. There are five classes of numbers in deep networks: input numbers to a layer, output numbers of layer, model numbers, gradient numbers, and communication numbers. In here as shown in parentheses, the range of low-precision values that have been used for each class of numbers, and in brackets, are recommended tradeoff. Practitioners have shown that both the input and output vectors typically only require eight bits of precision. For the model weights, usually eight bit is also sufficient. The gradients are often very small and require the most precision. In turn, others have successfully trained various models with the gradients are only 16 bits of precision. So while the future will likely only use 16 bits for the gradients, the current standard is still 32 bits, in order to guarantee convergence across all models. Finally, in order to distributed the training across multiple nodes, the gradients are often communicated, and 16 bits should be sufficient. Microsoft has even shown results communicating with only one bit. We will now explore one of the hottest topics in deep learning. Generative adversarial networks. Ian Goodfellow created a detailed lecture on generative adversarial networks that I recommend you explore if you're interested in learning more about GANs. As we have seen, convolutional networks can take as input an image of a bus, and output the category label bus. CNN's have gotten very good at Image recognition and do not fail very often in real images. However, a group of researchers from Google, Facebook, and NYU, specifically designed noise and added it to real images to cause thinnings to classify them as ostriches across all the images shown here. The figure on the center is the noise added to the image magnified 10 times for visualization purposes. To the human eye, the image on the right looks exactly like the image on the left. But to this CNN the image was classified as an ostrich. This raised a lot of interest in the vision community. Some said that adversarial examples with noise were unnatural images and should therefore not be relevant. Others argue that it was important and many trade training neural networks with adversarial examples. It was found that training networks with adversarial examples made it more difficult for them, though it did have a minor impact on the classification performance. Generative adversarial networks take this a step further. Instead of adding noise to images, a network is used to generate adversarial examples. These two networks compete against each other. One trying to generate images to fool the CNN, and the other one trying to distinguish real images from fake images. A surprising result is that as you improve your CNN, your adversarial network generates real images not previously found in the dataset. A young pioneer, young LECun who oversees air research at Facebook has called GANs the most interesting idea in the last 10 years in machine learning. Most GANs today are at least loosely based on the DCGAN Architecture which stands for deep convolutional GAN. In the DCGAN Architecture, most deconvs are batch normalized, and the models does not contain either pulling or unpulling layers. Instead when the generator needs to increase the spatial dimensions of the representation, it uses transpose convolution with a stride greater than one. DCGANs have been able to generate high quality images when trained on restricted domains of images, such as images of bedrooms when trained on the LSUN dataset. DCGANs have also demonstrated that GANs can do vector space arithmetic. For example, if you take the vector associated with the image man with glasses, and subtract the vector from man and add the vector for women, the resulting vector maps two images that correspond to woman with glasses. Minibatch features which allowed the discriminator to compare an example to a minibatch of generated samples and a beanie batch of real samples, have been quite successful. Minibatch GANs train on the c14 dataset, obtain excellent results, as we can see that the produce samples on the right, can be generally classified into specifics c14 classes including some cars, horses, and planes. Minibatch GANs have also been trained on 128 by 128 ImageNet images, which also produces images that are somewhat recognizable. One application of GANs is text to image synthesis, where the input to the model is a caption for an image, and the output is an image matching that description. We can see that GAN generate images that correspond well to the description given. GANs have also been used for Single Image Super-Resolution and has been found to demonstrate excellent Super-Resolution result as shown here. Because GANs are able to recognize multiple correct answers, the network does not need to average or the many answers producing better results. Adobe created an application called interactive generative adversarial networks or iGAN which allows a user to draw a rough sketch of an image such as a black triangle and a few green lines to produce an image of a mountain with a grassy field. Ultimately, GANs have been showing many promising results and is a very hard topic in deep learning currently. We will now investigate deep voice, text to speech or TTS, which allows Human-Computer Interaction without requiring visual interfaces. By this recent deep voice, real time neuron text to speech consist of several building blocks each one using a different deep network. The first network is a grapheme-to-phoneme which converts grapheme or English text characters such as letters and punctuation to phonemes which are units of sound. For example, the input Hello, returns the phonemes "HH", "EH", "L", "OW". Second segmentation which is phoneme-audio-alignment. This identifies were in the audio each phoneme begins and ends. This is only using training when the ground truth audio is present and not during inference. The third network is a phoneme-duration and fundamental-frequency predictor. This predicts the duration of every phoneme in an utterance and the fundamental frequency or F0 throughout the phonemes duration. The fourth network is an audio-synthesis generator. This network combines the output of the grapheme-to-phoneme and segmentation models to synthesize audio at a high sampling rate. In many production qualities, TTS applications, real time inference is critical to avoid delays and provide a more natural user interaction. Inter-CPUs were found to offer faster than real time inference due to cut friendly memory access and by avoiding recomputations. We will end this lecture by discussing reinforcement learning, a branch of machine learning that allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. Reinforcement learning has a wide range of applications. It has been used to control robotic arms. It's been used for navigations, for assembly and logic games. Because decision processors are quite numerous, reinforcement learning has a lot of potential to solve a large number of problems in artificial intelligence. The majority of this course focused on supervised learning which uses train examples that are labeled. Mission accuracy with this type of learning is quite easy and supervised learning is usually task-driven such as with classification. And supervised learning involves looking for patterns within data as the data is not labelled. This can include clustering and the evaluation of a model success is usually indirect or qualitative. Finally, reinforcement learning is reward driven. The environment provide rewards to actions that guide the learning. A deep reinforcement learning algorithm follows this process during training. An autonomous agent learns how to choose the optimal action in each state to achieve its goal. Every action taken by the agent in the environment, produces a positive or negative reward. The goal of training is to optimize the narrow parameters that maximizes the expected sum of rewards modulated by a discounted factor y, a value between zero and one. Note that a reward r_t is a weighted sum of all rewards afterwards whose values are exponentially less or more important depending if it's a positive or negative reward and by using a discount factor. For example, y=0.99. The figure below illustrates this concept. On the first row, we see the positive and negative awards after each game. However, note that the vast majority of the rewards are zero. On the second row of the bottom figure, we see there discounted values. The zero value reward in the first rule have non-zero values now as a result of a moving average over the original report values. The discounted reward r_t, now can be used to maximize the lock probability of actions that lead to good outcomes and minimize the probability of those that did not as shown here. From this, the more parameters theta are updated considering both the noise gradients computed in an environment with high uncertainty and driven by a range of positive or negative reports over time. During today's lecture, we have covered some hard deep learning research topics such as low-bit precision, GANs, deep voice, and reinforcement learning. In our next lecture, we will look at what Intel is doing for deep learning.