Hello and welcome to week 12, the final week of the course. As the expression goes, time flies when you're having fun. During this week we'll cover some of the more recent results in image and video processing. They are based on the simple notion of sparsity. A signal can be sparse in its native domain for example, an image of the stars in the night sky, or it can be sparse or compressible in some transform domain such as the DCD domain. Our first reaction might be, what's the big deal? We know that an image is sparse in the DCD domain from week nine already. The big deal is that if we know that a solution to a problem is sparse, we can then solve problems for which we did not have meaningful solutions before. One such problem is the solution of an under-determined system of equations. Such a system appears in numerous applications. In the first segment of this week, we will look at applications not just from the image and video processing domains, but those are from machine learning, statistics, genetics, econometrics, neuroscience, and other domains. We will describe them at the high level in order to see under what circumstances we have sparse signals or vectors in general. We will also discuss the formulation and even the solution approach. However, you will need the material we will cover in these subsequent segments, before being able to fully appreciate the solution approaches we'll be following. The applications are presented here to hopefully excite you, and increase your interest on the topic. We will revisit some of these applications however at the end of the week. So let us proceed with this current and exciting material. During this week, we will first talk about applications where sparsity plays a key role. And these applications will be from signal and the imagine video processing. But also from machine learning, statistics, economics and so on. We will then see what is the role that the l2, l0 and l1. And to see for example that the l0 and l1 are norms promotes sparsity. We will then discuss specific solution approaches to optimization problems where the l0 and l1 norms are involved. So when the l0 norm is involved, we'll talk about the matching pursuit. This is a greedy iterative algorithm for finding a solution. When the l1 norm is involved, we'll talk about smooth reformulations of both the basis pursuit and the last two problems. And finally your goal full circle, and revisit some of the applications, well show specifically, the formulation of the problem, and the solution approach that is followed in obtaining a solution. What is sparsity? As the name implies, a vector is sparse. If it only has a few non-zero components while the rest are zero. Now this vector can represent a signal such an image or a video, in which case the signal can be sparse in its negative domain. If for example we have an image of the sky at night, so there are some bright starts only and the sky is dark. Or it can be made sparse in another domain. So natural images, for example in the discrete transform domain are sparse. A sparse vector, however, may originate in numerous other applications as we will see. There's a large number of applications in which sparsity plays a key role. Of course for the purpose of this course, image and video processing applications are the ones we primarily are interested in. But in the beginning we'll also talk about applications from machine learning in general, statistics, genetics, econometrics, neuroscience and so on and so forth. It is known that some diseases like cystic fibrosis, are caused by alterations in the single gene or a small set of genes. And this knowledge can be utilized in creating gene targeted therapies. However, the genetic basis of many leading causes of death in the world, such as heart disease and constant diabetes are so far not clearly understood, and this limits the use of gene therapy in this situations. Genobite association, that is the association between billions of genetic loci, from across the human genome. With quantitative traits, such as cholesterol and glucose levels. Which, in turn can be used as strong predictors of the above mentioned diseases. This problem however, the genome wide association, is extremely under determined. Because even for large studies, there are typically a few thousands of individuals but there are tens of thousands of genes. We show here a linear modeling of the problem. So an AX equals B formulations. A is the genotype matrix. And contains genetic information about each of the individuals that took part in the study, and their corresponding trait levels here. So each of these entries of this matrix show the number of so-called alleles. These are the alternative forms of the same gene for the Ith person at the Jth locus. So this matrix A is a long matrix. So for example, it can have 1,000 rows and 60,000 columns. In solving this problem, it is desired that the solution, here, x is a long vector, is sparse. This way there are a few genes that give rise to understanding this particular trait. So for example, if this is a nonzero entry, then it is this column of this matrix, that would be utilized again towards the prediction of this particular trait. So this is a specific example where we're asked to solve an a x equals b problem, and the solution x here is desired to be sparse. Functional magnetic resonance imaging, fMRI of the brain is becoming more and more popular in diagnosing disorders such as ADHD, Attention Deficit Hyperactivity Disorder. In this application, three dimensional scans of the brain are acquired while the subject is being asked to perform certain cognitive tasks. Now only certain parts of the brain light up or fire up, the ones responsible for the specific task being performed. The fMRI scans actually can be combined with EEG, electroencephalography signals. And finding which parts fire up in the brain is referred to as the source localization problem. So you see in these examples here, which parts of the brain fire up while the subject, the person, is performing a rhyming task. Now in processing such volumes, we typically divide them into small cubes which we refer to as voxels. So these are volume pixels or three-dimensional pixels. So scanning the volume we can stack these voxels into vectors. And again, here is another application in which the vectors we deal with are sparse. So the problems we might be interested in solving are, for example, the source localization problem that I just mentioned. Or we might be interested in solving a classification problem, if the person is classified as having ADHD, for example, or not, in which case sparse logistic regression can be used. Or we might be solving a prediction problem, trying to predict if the person will develop ADHD in the future. So again, here's an example where the signals we're working with are sparse due to their nature. Econometrics is allowing economists to sift through huge amounts of data and extract simple and useful relationships. So here we see a map of the world. And for each country we see the GDP, gross domestic product per capita. An interesting question is which factors correlate, in a positive or negative way, to the GDP of a country. So economists have considered hundreds of factors, such as the ones shown here. Population density, size of economy, life expectancy, land area and so on. So we put in a matrix A the actual values of these factors, and in a vector b the GDP of each country. Then we are interested in finding an x so that Ax=b. So we want to solve this linear system of equations under the constraint that the solution we obtain, x here, is sparse. Similarly to the genetics problem, since we expect a small number of factors to have strong correlation, positive or negative, to the GDP, while the rest of them being irrelevant or equal to zero. Actually from studies, the factors here showed inside these red rectangles show strong correlation, negative or positive, to the GDP. So again, we need to know how to solve an Ax=b system of linear equations under the constraint that x is a sparse solution. Linear regression is a common problem in many application areas. Given a set of points, as shown here, we want to find a line that best describes or best fits these points. So if we call the axes here a and b, we are given pairs of points, a of i, b of i, such as this one. And the model we use, the model of the line, is that a of i x is approximately equal to b of i. Our objective, therefore, is to find the best possible value of x to describe this data. A very common approach we follow is the least squares approach according to which we minimize with respect to x the sum of the squared errors. This minimization will result in an optimal value of x, based on which, if a new point a is given to us, we can form a x* which will give us the new value of b. So if here is a new, we have found the equation of this line after we find the x optimal, and therefore we can use it to find the value b new. We can write these equations in matrix vector form. Our model takes the form Ax=b, where in vector b we have stacked all the values b of i. And matrix a, we stack the values of a of i which in general is not a scalar but is a row vector. Then the least squares objective here is to minimize with respect to x, this norm squared, and this is referred to as an L2 norm, as we will see later in the class. In a number of applications, there are sparse outliers in the data, like the ones shown here. So these are clearly outliers, they don't look like the rest of the data, but they're sparse. There are just a few of them. Otherwise this would be the data. The question is now how well can a line describe this data if we follow the least squares approach that we described in the previous slide? Following a least squares approach in solving the linear regression problem, we obtained the blue line here as the optimal line for fitting the available data. Clearly a least squares approach towards linear regression tends to fit through these outliers and therefore produces poor results. We can make linear regression robust by introducing a sparse error term that complements the linear model. That means we are using now the model Ax + e approximately equal to b. So this is the sparse error term. And the objective is to find the optimal x* here to describe the data. So we need ways now to be able to solve this problem by taking into account the sparsity of e. And as you'll see, we have means through the use of L0 and L1 and the Lp norms to accomplish this. The resulting method is referred to as the least absolute deviations approach. In some other applications, we might be interested not just in robustifying the least squares problem, as we described here, but also in finding the sparse term e as well. So following the least absolute deviations approach with this story example here, we obtain the red curve as the best curve to fit the data. So this is LAD, least absolute deviations, and this is least squares. So clearly we are doing a much better job now in describing the data and not fitting to the outliers. Matching consumers with the most appropriate products is key to enhancing user satisfaction and loyalty. Therefore, more and more retailers are interested in recommender systems. Systems which would predict the preference of a user for a product based on ways the retailer can then recommend or not recommend the product to the user. The version of the problem addressed by a recommended system is described here. It relates to the so called Netflix problem, since it's posed by the provider of streaming movies here in the US, Netflix. So we have in this column values movies, in this row values users. When a movie is rented and watched by a viewer, he or she is asked to rate the movie, let's say from zero to five. Or from zero stars to five stars. So this person here watched this movie and gave it one star, while this person watched this movie and gave it four stars. So each user watched and rated seven movies. But of course there are a large number of other movies that he or she has not watched. So we don't know the preference of this user to this movie, and the retailer would like to predict this preference. So that the retailer can decide whether to recommend or not recommend this movie to this user. So the retailer has an incomplete matrix. Some of its enters are known but very many others are not known and would like to complete this matrix. So the problem becomes a matrix completion problem. But the way that there are thousands of movies but millions of users, so this is really a very large matrix. The matrix completion problem by the way appears in values other applications. Clearly if there is not issuing information or additional constraints, we cannot provide the meaningful solution to this problem. The additional information that we have with the problem at hand is that there are typically a few types of people, or to put it differently, similar people like similar movies. So if these two users have similar preferences in movies then this column of the matrix in this column will be very similar or they would be identical. A matrix with many similar columns or groups of similar columns is of low rank. The Matrix Completion Program therefore becomes a Rank Minimization Problem. A low rank matrix has a number of zero singular values. Therefore, the vector of singular values of the lower end matrix is sparse. It is the nuclear normative matrix, which is equal to the sum of the singular values of the matrix, but is typically minimized, subject to the requirement that the fidelity to the data is preserved. So, here's yet another example where we encounter a sparse vector, and this vector in this particular case contains the singular values of the matrix we tried to complete, since this matrix is a low rank matrix. Here is the denoising or noise smoothing problem again. We addressed this problem at various points during the course. For example, we talked about enhancement during week five and also when we talked about recovery during week six and seven. We can look at the image and noise in the problem from a different angle as well. We know from the discussion from week nine when we talked about compression that we can transform this image to a different domain and in that different domain only a few transform coefficients can do a good job in representing the image. There goes for example the DCD transfer. So if we consider a patch here of this image. Let's call it Yi. Then we can write that Yi = Axi. A here represents either the basis of the DCD transform, so it's a fixed dictionary. Or it can be designed specifically for classes of images. In which case can be another complete, a long matrix, another complete dictionary. The important point here for the current discussion is that this vector xy is a sparse vector. Which means that only a few coefficients of this transfer are needed to provide a good representation of this patch, or a good approximation of this patch. So by and large, we're interested in solving problems of that nature, so y of i is the observed noise image, A is the dictionary either fixed or designed from the data, and we are interested in solving for x of i. We can obtain a solution to this problem by minimizing with respect to x of i the L2 norm. Of Y minus AX squared plus lambda. The L1 norm of this Vector with the coefficients. As we will see, the L1 norm promotes sparsity. Since noise is not sparse in the transform domain, by solving this problem we are going to reconstruct the smooth patch, and of course we can solve this problem for each and every patch of the image, or we can combine all the patches and we can rewrite this equation term, so matrices. So, if we follow such an approach, here is the resulting denoised image. We are going to talk about the details of this approach later in this lecture. We talked about image inpainting, when we covered image recovery during week six and seven. It is indeed a recovery problem. Information is lost due to some scratches in the image as depicted here or due to the superposition of text on the image. We're going to recover this lost information but the pixel intensities which I represent again here as black pixels. Why there are a number of techniques published in the literature in providing solutions to this problem, we want to also obtain solutions utilizing the sparsity machinery we are discussing this week. We will use similar ideas for the ones who used for denoising. So if we look at the patch here of this image, will be noted by y of i, and we'll use this model to represent y of i. RAx of I, so A is the dictionary again, which is either fixed or designed utilizing the data X of i is the sparse vector that provide this approximation and R is the mask which allows us to use this representation for the pics of that we know but not for the pics which we do not know that black pics has another words inside this patch. We're going again to be able to obtain solutions for x sub i we just pass. And we can accomplish that. Can can update in Xi* here as the argument, but minimizes with respect to X of i. The L2 norm of this error squared plus lambda the l1 norm of this representation vector. Which again this remote sparsity. We can utilize x star to obtain the inpainted image as Ax of i star. So putting this machine to work, we obtain these results here. We see that the information that was lost due to scratches and text super position is fully recovered. We also mention super-resolution during week six and seven, when we talked about recovery. It is indeed a recovery problem. Information is lost during the down sampling process. We are observing an image like this one, and we want for example, to increase its resolution by a factor of four in each dimension. There are many techniques in the literature providing solutions to this SR problem. We can also provide solutions utilizing dictionaries and the sparsing machinery we are discussing this week. We assume we have a set of low-resolution and corresponding high-resolution images. Utilizing this data, we train low resolution and high resolution dictionaries simultaneously. Relating these low resolution and high resolution patches. The same sparse coefficient vector provides the representation of both the low resolution and its corresponding high-resolution patch. So if we take this patch here and denote it by H low resolution, given this patch in the whole image you can find an x star sparse, as we already mentioned. To represent y low resolution as a low resolution H star sparse. But also y high resolution equal to A the higher resolution dictionary, but the same representation sparse coefficient vector. So, utilizing this approach, we observe this, and we can reconstruct this. And with the image at hand here, we can obtain the higher resolution image as shown here. If we are to blow up this patch here in yellow by using pixel duplication we get this image, while if we blow up this part here we obtain this image. So clearly there's a lot of additional resolution that has been recovered due to this method we discussed here. A difficult problem in video surveillance is to separate the background from the foreground in a video. Here is an example of such a video. This is an old problem in video processing and computer vision that a lot of people have looked at. Using this sparsity machinery we're discussing this week, we can also obtain a solution to this problem. We put all the observed data into one matrix and then decompose this matrix into a low-rank matrix. As were the case in matrix completion. And the sparse matrix. The low rank matrix represents the background. This is a video, it seems like it's a still image because nothing changes in the background. While the sparse matrix represents the foreground. So, we see that we can obtain a very good separation of the foreground, background, using the machinery sparsity that we will cover this week. Who provides the details of the solution to this problem in the later lectures. Another application where we deal with sparse vectors is robust face recognition. Given a database of faces as shown here. We form a face dictionary. Let's call it A. In this A, we stock each and every image into a column of this matrix. Then, given a query image, such as this one, if we call this B. B is approximately equal to A x, where x here is a sparse vector. Indicating which entry in the data base this B corresponds to. So x can have just one entry equal to one if there's a perfect match of this query image into the database. Or the query image can be a combination of images in the database where x will have more than one entries, but nevertheless it is sparse. Here's another query image. In both these cases we have abnormalities such as a scratch or occlusion. Therefore the model we end up working with is that B is approximately equal to Ax plus e, so this is an error term that takes the scratches or the occlusion into account where also this vector e is sparse. Therefore we need to be able to solve problems of this nature whereby we just constrain again both x and e to be sparse solutions when we try to solve for x, and therefore we try to identify a person that exists in the database. Compressive sensing or compressive sampling is another important application area where sparsity plays a key role. We show here an n by n image, it has therefore n squared samples. We talked about Nyquist's theorem earlier in the course according to which I need to sample with something frequencies at least twice as high as the highest frequency in the signal. In the image you're talking about special frequencies in the two dimensions. According to compressive sampling, we can reconstruct the original image with very little or no error if we acquire fewer samples than those dictated by Nyquist's theorem. The sampling, however, now is not done with deltas, where we go and pick out the value of each and every pixel. But by projecting the image onto random bases. Projection means that we form the inner product of the image with a random matrix, and this inner product will give me a scalar which represents the sample. Then, we end up with an under-determined system of equations, which we can solve efficiently, as we'll see later in the class, while informing sparsity. And the image is sparse either in its original domain, or in a transform domain. So by acquiring only 50% of the samples, in other words N squared over 2 pixels, and carrying out the reconstruction, we obtain this image. It is almost indistinguishable from the original image. Utilizing only 25% of the sample, so N squared over 4, we see this reconstruction. And if we go down all the way to 10%, N squared over 10, we see the reconstruction shown here. So we're going to talk about this application later in this course with considerably more detail.