In this final section, we'll look at a few of the applications of sparsity. We really just touched the tip of the iceberg. It's really impressive how many different problems have been solved using the ideas we have presented this week. Both in image processing and machine learning. We will discuss, for example, the noise smoothing in painting and super resolution problems, which are inverse recovery problems that we also discussed back in week six and seven. The problems are the same, but their formulation and solution approaches are different now. We will also discuss some newer problems, such as the foreground-background separation problem. Through rank minimization and robust PCA. We will finally discuss the compressive sensing problem, which has given rise to a lot of activity in the area in the recent past. So let us have a closer look at these applications now. In forming the observation vector B, we typically take the image, like the one shown here, and break it into blocks or patches. So here's a patch out of that image. And then, we stack this patch, the sub-image lexicographically into a vector, as shown here. So if this is, for example, an 8 by 8 patch, this becomes a 64 by 1 vector. In forming the dictionary matrix A, we first look at the case when the dictionary's starting. So if the DCT uses a dictionary, we show here, the 8 by 8 basis of the DCT. We looked at this image actually, multiple times. When, for example, we started the image and video compression in weeks nine and ten. We take each of these two-dimensional bases, turn them into a vector and this becomes a column of the dictionary matrix A, these actually are called also atoms of the dictionary. In this case, we have a rectangular dictionary, 64 by 64, in this particular example. We demonstrate here with a simple experiment that natural limit is sparsely represented over the DCT. This, of course, something we know very well from weeks nine and ten, when we covered image and video compression. Given this image, we take its discrete Fourier Transform and order the coefficient here in decreasing order. This is, by the way, log scale. Now, if we keep only 25% of the largest coefficients, in other words, we throw everything away above this line, then this is a constructed image. If we only keep the 10% of the largest coefficients, everything above this line now is thrown away, this is what we obtain. And if we only keep 1% of the largest coefficient, so what's below this blue line, so this one goes away as well, this is the construction. These two images look almost indistinguishable from the original one. So this demonstrates also the energy compaction of the discrete cosine transfer. While this one is also of good quality, although when examined side by side with the original, one can see that smoothness has taken place. We show here an 8 by 8 patch of this image. Which can be accurately represented by 14 bases out of 64 bases of the DCT based dictionary. We do the same experiment with an 8 by 8 patch from a noise here image. And we see that in this case,2 we need at least 32 out of 64 bases to obtain a reasonable representation of this patch. Another way to describe this is to say that noise is not compressible. Which is something we know, since the noise signal does not have any correlations in it, or any structure in it, that we can exploit in order to compress it. This property is useful to us when we deal with a denoising problem, simply because the signal can be sparsely represented utilizing a dictionary, such as the DCT, while the noise component cannot. So seeking a sparse representation of the signal, we know that we will not be reconstruction the noise component, but just the signal component. Here's the formulation of the imaging denoising problem. B contains the patches of the input noisy image. Each patch is a column in this matrix B, while A is the dictionary. If the dictionary's fixed, if we use for example, the DCT, then we solve this last type of problem by minimizing with respect to x. However, we can use a dictionary which we learn from the data. In which case, we then minimize with respect to A as well. So this now is the bi-convex dictionary learning problem, which we described earlier. And we alternate optimization between A and X. A is fixed, we are solving a lasso problem, X is fixed. We solve the squares type of problem, for which we have, a closed-form solution. Of course we can use the L Zero norm as well, in which case, we follow the MOD approach. So given the sparse representation X, then AX star, the optimal solution forms the recovered image. We show here in experimental result, this is your original image. This is your observed noisy image, the peak signal to noise ratio is 22 dB. And utilizing a fixed dictionary based on the DCT, here is the solution we obtain utilizing the method we showed you in the previous slide. So this is the noise-free or denoised image with PSNR, 33 and a half dB. We introduced the image inpainting problem, multiple times, already. So B is the observed image, and certain pixels are missing. Which pixels are missing is indicated by this R matrix, the degradation matrix or the degradation mask. So for the pixels that are not missing, that we know their value, then we try to minimize this error. This, again, the Frobenius norm, since all here quantities are in matrix form. So, we solved this lasso problem with respect to x. Finding an optimal X star, then we form the recovered image as a X star. The formulation is shown, when the dictionary is fixed. We can however, design the dictionary according to what we covered two slides ago, so carry out the minimization with respect to A as well. So here is an experimental result. This is the original image. And this is the degraded image, where 50% of the pixels are set to black or missing. Utilizing the method I just described with a fixed dictionary, a DCT based dictionary, here is the resulting inpainted image. It's certainly a very satisfactory result. It should be mentioned here however that in carrying out the inpainting as what's described here, we assume that the location of the mixing pixels that matrix are that I included in the formulation in the previous slide is exactly known. In general however this may not be the case, and one needs to also identify which pixels need to be inpainted first before the inpainting process is applied. We talked about the image resolution problem multiple times. We first introduced it in week six and seven when we talked about image recovery. The problem is, given a low-resolution image merged to find the estimate of a high resolution version of it. We describe here a formulation of the problem and a solution approach when dictionaries and the sparse representation is utilized. So according to this approach there is a training and a testing phase. During training we assume that we have pairs of images. And each pair consists of a low resolution image and its corresponding high resolution version of it. So during training, we design both a low resolution dictionary and a high resolution dictionary. And this parsed representation of the signal utilizing both the low and the high resolution dictionaries. The important point here, is that this passive presentation is identical, no matter which dictionary you are using. So this is the term the Frobenius arrow, regarding the low resolution data, this is the Frobenian. Nor the error we got in the high resolution data, and here is the requirement that this representation is sparse. I should mention here that we use biliniar of the low resolution image, so the sizes of the low and high resolution dictionaries is the same, and therefore this vector here is identical. Then during the reconstruction phase, a new low resolution image is observed and utilizing the low resolution dictionary, we try to find the optimal space representation of this particular image. Then having x star here will utilize the high-resolution dictionary to obtain the high-resolution image. Here's a low-resolution image, of which you want to find its high-resolution version. Utilizing the method I just described in the previous slide, here is a high-resolution version of it. The interpolation factor is two in each dimension. If I look at a specific block of the first image and the corresponding block in the high resolution image and blow up these two blocks, then this is how they look like. So, clearly interpolating using zero to hold the low resolution version obtained in this image, while this is hard to obtain from the high resolution image. Clearly, you have all these jagged edges, here, which are not present in the high resolution version of the image. We've gotten the robust face recognition problem. We assume that for each person we have multiple versions of its face and each image is under different lighting conditions and slight pose variations. So each of these images forms a vector which becomes a column in the dictionary for the nth person here. So all these individual face dictionaries and they're concatenated as shown here to form the face dictionary for the whole database. Now we are given a query image and we try to find if this person can be found in the database. Since the square image has aberrations such as scratches, then the model we use is that b is approximately equal to Ax. So this is the representation according to the dictionary plus a noise term. So x is sparse because this face is one of the many faces in the database, and also e, the arrow is sparse. It only represents a small part of the whole image. So the formulation we follow in solving this problem is shown here. We minimize the L1 normal vex, the sparse representation plus lambda, await the L1 norm of the error. So the error as well as x are sparse signals. And the constraint is that Ax plus arrow equals the observation. [INAUDIBLE] A solution to this problem we introduce z, this concatenated vector and if the matrix F is equal to this, then we can write this problem as shown here. Minimize the L1 of z and the constraint of F of z equals B and this clear the basis pursuit problem. We looked at this problem earlier this week and we know certain ways to obtain a solution. Utilizing the methodology presented in the previous slide, for this query image, here is the reconstructed error image. And here is the reconstructed face image. The sparse representation vector x* looks like this. So that the constructed face you made is a combination of a number of images in the database. Very often, after the face is reconstructed, we are interested in identifying the person. If you recall that overall dictionary A consists of the dictionaries that belong to individual persons. So, we either find the largest values in the representation x* or we find the reconstruction error per subdictionary, per dictionary belonging to an individual. So, if for example, for this particular case, we find that the largest value here belongs to the dictionary AM. This means that the person they constructed here is identified as the nth person in the database. As mentioned at the beginning of this week's lecture, we can use sparsity ideas to solve this surveillance type of problem which consists of the separation of a video sequence into the background frames and the foreground or moving parts frames. The basic idea here is that if we look at this background images, if we take actually each frame and form a vector and put each vector as a column into a matrix L is that this matrix L, that consists again of the background images, assuming this background is stationary, these columns will be very similar. So therefore, this is a low-rank matrix. And again if each frame is absolutely identical then this is around 1 matrix. So the question is how to model this low rank matrix. And regarding the moving part or the foreground since it only occupied Small part of the frame then this matrix E that is formed by this frame that would become the columns of E is a sparse matrix. Of course we know what to model sparsity by utilizing either the L0 or the L1 norm. So, let's see what it takes to solve this surveillance problem, the separation of a video sequence into background and moving parts, or foreground. In proceeding with the original problem, the foreground background separation, the first step we have to take is to be able to approximate a matrix B by another matrix L so that the rank of L is less equal than a given rank k. And the approximation is in terms of this Frobenius norm. The solution to this problem can be found by looking at the singular value decomposition of B, u is the left singular vector, the right singular vector and S of y is the singular value. So B is rank r. So here is how the composition of B looks like and the singular values are ordered from larger to smallest. So given the singular value decomposition of B, the best matrix l that approximates B in terms again of this Frobenius norm is found by keeping the k largest singular values of this decomposition. We show here the formulation of the background-foreground separation problem. B are the observed data. In L, we represent the background, so this is the low-rank matrix, while in E, we represent the foreground, so this is the sparse matrix. So we want to minimize the error between these terms utilizing the Frobenius while making sure that E is sparse and L is a low rank. So the rank of L is below a given number here k. We'll follow, two different directions in solving this problem. Going to the first one we find the singular value decomposition of L so the matrix sigma has the singular values in descending order. Then the problem is reformulated as shown here. The objective is the same. However, the constraint now is in terms of the matrix sigma. We want the L0 norm of this matrix to be below k. A different formulation of the problem is shown here by utilizing the so called nuclear norm. The nuclear norm is the sum of the singular values of the matrix. So in other words, it's the L1 of the matrix sigma. So the L1 being a convex norm, we can then have this unconstrained optimization problem. The second direction it can follow is by decomposing the L matrix, the low ruank matrix, into the product of A and X. The rank of L is determined by the number of columns in the matrix A. So we make the matrix A, let's say L is N by M, so A would be N by k, and then X could be k by M. Where, again, k is the desired rank of the matrix. So then, with this decomposition, we're just minimizing with respect to A, X and E, that difference here with respect to the Frobenius norm plus the L1 of V, so that we make sure that E is plus. So in other words, the constraint of rank L been less than k he has been absorbed into this, decomposition, has been absorbed into the number of columns of matrix A. So peer as a result they showed at the beginning of this week's lecture. Here is the observed video B. While here's the low run that L presenting the background it seems like as if nothing moves, it's a stationary background. While here is the foreground so this is the sparse matrix E. And of course we know how to obtain L and E based on the discussion of the previous line. We show here that the traditional way of acquiring the signal or an image. We use a sensor. We sample at Nyquist, acquire N samples then we exploit the redundancy or the structure in the data and compress the data. Therefore, now K samples I used to represent the image where K is much smaller than N. The compressed image is transmitted or stored and then either received or retrieved. And from the K samples used in decompression, we end up with a reconstructed image utilizing and samples again. So, sampling here by delta functions, is dictated by the Nyquist theorem. And the idea here is to try to do better than that in some sense by combining the sampling and compression step. So, the idea of compressive sensing is we directly acquire the compressed data, so we combine again sampling and compression. So, we acquire now M samples, where M is much less than what's dictated by Nyquist and greater typically than what we can obtain by compressing the data that are acquired according to Nyquist. So the pattern is the same, the data is either transmitted or stored, received or retrieved. And then from the M samples using a reconstruction procedure, we can obtain a representation of the original scene. The question here is how to obtain these M samples. And as you see we do so by using generalized or, general measurements. Let us assume without loss of generality that the signal X we are sensing it's sparse in its original or spatial domain. So it's K-sparse, means the K elements are non zero, and the sparsity with respect to a basis or a dictionary A. So with a traditional sampling approach, A is an identity. I used deltas to sample the big cell locations. So these values transferred directly to the observed data. With compressed acquisitions, we use a sampling matrix A here which is non-diagonal. It's a long matrix, because the number of acquired samples M is less than N, the dimension of the vector X. X is again K-sparse And now, each acquisition is formed by getting out the inner product between the corresponding row of A with X. So, the compressive acquisition directly acquires this condensed representation with little or no information loss through this dimensionality reduction. The important question is, what should the sampling matrices be equal to? And the interesting and maybe surprising results are that the sampling matrices are random matrices. And in acquiring each and every compressed acquisition, we carry out the inner product between the original image and one of these matrices. So the inner or docked product will give me a scalar which is just one specific acquisition. In the previous slide, I treated that as the role of A multiplying X. So that's the inner product of two vectors which is exactly the same. If I vectorize these images then that matrix inner product becomes vector inner product. So, this is a powerful result that we can use again, random matrices as the sampling matrices to obtain these compressed acquisitions. Suppose we have this one dimensional signal. It has capital N samples, and we are allowed to use only a certain percent, let's say 10% of the samples into presenting the signal. If we use the traditional, the Nyquist delta based sampling and use impulses to pull out elements of the signal. Most probably, we're going to miss some of the important elements in the signal which are these impulses. I don't know the location of the impulses. And therefore, at random or equally spaced deltas are going to miss some of these impulses. When we perform compressed acquisition, however, we carry out the inner product of the signal with these random components. And clearly, each now compressed sample is going to contain information about all these impulses. So the signal is local, but the measurements are global, and as already mentioned, in each compressed position, there is little information about each and every component of the signal. Therefore, getting out a number of them we have multiple incarnations about the deltas and the signal, and therefore we are able to reconstruct the original signal from these compressed acquisitions. We mentioned earlier that we assumed without loss of generality that the signal, the image you're trying to acquire is sparse in its native domain same. However this is not a strict constraint because as we know we can obtain an image and through dictionary, a fix dictionary or a design dictionary can have a sparse representation of this signal x with respect to the dictionary D and y is the sparse representation. So this is the model, the more general model, that the acquisition b is the result of A, the sampling matrix, multiplying the dictionary D times the sparse representation of the signal y. And according to the theoretical results, random measurements can be used in this case as well. So the end result here is that we have one vector equals a long matrix. The product of eight with D times the sparse long vector y. So this is b, this here is AD, and this is y. So we're given B, we're given AD and try to find y. So this is an inverse problem again, and is an underdetermined system of equations. And throughout this week, we've been looking at different ways to reconstruct sparse signals from an underdetermined system of equations. We can use the L0 norm or the L1 norm. And we can use the basis pursuit, or the lasso formulation of the problem, so there's plethora of techniques, like the ones we started in reconstructing the signal from its compressed acquisitions. We show here some experimental results of reconstructed images from compressed acquisitions. Here is the original image, it's actually the image that is reconstructed when the number of compressed acquisitions equals the number of pixels. It's very close to the original but not identical because also noise was added to the compressive positions. When the nightmare of compressed positions is half the number of pixels here's the construction. 25% of the number of pixels here is a construction and all the way down to 5% of the number of pixels. Clearly up to this level, the reconstructions are very close to the original, and the 5% reconstruction is surprisingly good, surprisingly close to the original. This, by the way, the so-called Shepp-Logan phantom, which is utilized for Studying and comparing reconstruction for projection algorithms. We show here the performance of two different algorithms. One is using L1 for enforcing sparsity, so it's a lasso problem, while the other algorithm is using TV as a regularizing TV also promotes peace wide solutions. And then, we have two images, the shepp log and phantom I showed you in the previous slide. And the Cam [INAUDIBLE] I showed at the beginning of this week's lecture. We see that the TV with a Shepp log performed the best. But also the L1 for the Shepp-Logan is the second best. So, the Shepp-Logan is an easier image to reconstruct from projections, and here's the performance of TV and L1, with respect to the. So the image I showed in the previous slide with 5% here acquisitions utilizing that Phantom TV was used. So, indeed the quality, the Peak SNR is 30 dB, which is a, generally speaking, decent quality image. With 100% acquisitions, the Peak SNR is between 55 here approximately and 30 so it's high but not extremely high. In principle it should go to infinity if the reconstructed image was identical to the original one. But as I mentioned, noise is added to the compressed acquisitions. Signal to noise ratio is B, and this is one reason that perfect is prevented. We were exposed this final week to some recent and hot developments in signal in image processing and much in learning. We saw how the simple notion and information about the sparsity of a signal can change the way we solve certain problems. We learned how to solve problems which require the signals are sparse, we were also exposed to some of the interesting applications of which dissolved it from such considerations. This week's material has close ties with the material from recovery we covered during week six. We talked quite a bit there about constraint optimization, the ideas and thoughts are very similar but now the functions we optimized are convex but not differential. We also saw great connections with the material of weeks nine and ten, where we covered compression with the use of the [INAUDIBLE]. It's great to see all these connections made. As a final note, it's hard to believe that it has been 12 weeks that we have been together. It has been a new experience for me interacting with so many of you from all corners of the globe. Teaching this one class resulted in interactions with many more students that I taught here at Northwestern for the last 30 years. It's been a rewarding experience for me, and I hope you share the same opinion. Best of luck putting together the information you learned here. I hope the material you learned will prove useful and I'm also certain that some of you will be the next generation pioneers in the field. I'm sure I'll see some of you around.