0:00

We will review the reason why we can compress images,

namely the redundancy that is present in natural images.

And then we will look at the fundamental ingredients in an image coding system.

But first let's start with a very simple thought experiment that will convince you

that image compression is a necessity.

Consider an image of size 256 x 256, encoded at 8 bits per pixel.

The amount of storage that an image like that requires

is on the order of 500,000 bits.

Now, from a purely combinatorial point of view,

each bit in this image can be set independently to zero or to one.

So the total number of possible images of size 256 x 256

at 8bbp Is a staggering 2 to the power of 524,288.

To put that in perspective, we can change the base and

express this as 10 to the power of 157,000 and something.

Now, compare this to the number of atoms in the universe.

Which is estimated to be 10 to the power of 82.

This means that clearly, the subset of meaningful images of

size 256 times 256 must be very, very, very small.

Compared to the number of all possible images.

Let's take a different approach.

Assume that we can actually collect all the images that exist in the world.

Pictures, drawings and so on and so forth.

And suppose that we can list them in a massive encyclopedia of images.

Now that we have a list and a catalog of all possible images.

We can indicate each image in the encyclopedia just by giving the cardinal

number in the list.

Now, to put some numbers in this thought experiment

let's do this very simple back of the envelope calculation.

Current estimates say that on the internet we have about 50 billion images and

pictures that are floating around.

1:56

Let us assume to simplify matters that all of these images are encoded

at eight bits per pixel, and they have size 256 x 256.

Well, each of these images encoded in raw format requires in excess of 500,000 bits.

If we use the enumeration scheme from the encyclopedia of image.

Where we can identify each image by its index.

We just need to provide this number to specify an image.

And therefore the cost to encode an image,

is simply the log in base two of the number of images in the list.

So that's about 36 bits per image to encode all possible images in

the Internet today.

The downside of this encoding scheme is that it requires a lot of side

information.

In other words, I can send an image to you by sending you its index number.

But unless you have the full encyclopedia at your disposal,

you won't be able to see the image itself.

So on the one hand we have indication that there is an enormous redundancy in

the amount of space that we allocate to an image.

And on the other hand,

we have a sort of lower bound that tells us that the number of images that makes

sense in theory could be encoded with approximately 30 bits per image.

Neither approach gives us the solution to the problem.

So we need to analyze the physical properties of images in more detail.

The idea is to exploit the physical redundancy in images

to reduce the bit budget.

But, of course, we have to be careful and

define redundancy specifically in terms of the human visual system.

In order to allocate bits to features that matter,

we have to find out what these features are.

And thankfully researchers have worked a lot on this problem.

And ran lots of psychovisual experiments to find out the things

that matter the most.

With this knowledge at hand, we can go ahead and

define what is called a lossy compression scheme.

In other words, an image encoding system that will discard the information

that it deems irrelevant with respect to either the intelligibility or

the quality of an image.

So let's look at the key ingredients that will lead us to the JPEG

compression scheme.

The first Is compressing the image at block level.

We'll see this with an example in just a second.

The second ingredient is using a suitable transform to move

the blocks into a different domain.

So we do a change of basis that will make it easier for

us to decide which parts to keep and which parts to discard.

4:25

Smart quantization is the way we will go about discarding irrelevant parts.

By allocating very few or no bits at all to some of the transform

coefficients we will be able to reduce the bit rate selectively.

And finally,

entropy coding is a technique that we will borrow from information theory.

And it will allow us to compress even more the bit stream that we have obtained from

the previous smart quantization.

To understand the importance of block coding,

let's see what we can do if we decide to compress an image at pixel level.

The only thing we can do in that case is to reduce number of bits that we allocate

to each pixel.

This is equivalent to quantizing the image using fewer levels.

And in the limit, we cannot go below one bit per pixel.

So if we try and apply this to our centered image, we get something

that has completely distorted the content of the original image.

So pixel level quantization only takes us so

far before image quality is completely compromised.

And now by comparison,

let's look at a very simple compression strategy that operates at block level.

What we do here is we divide the image into blocks and

then we code the average value with eight bits.

We use three by three pixel blocks using eight bits for each block.

So we use a little bit less than one bit per pixel.

And the result is what you see here on the right.

If you look attentively at the details you might perceive some jagged edges around

for instance here.

5:52

These are the so-called block artifacts and

come from the fact that we have split the image into small blocks.

But nonetheless, the overall result is definitely much, much better than what

we would obtain using pixel level compression at the same bit rate.

The power of block level compression comes from the fact that in each block

we exploit the local correlation between neighboring pixels.

And at the same time, since blocks are independent,

we separate the coding for different parts of the image.

The blocks scheme we've seen just before is a very simple naive one,

we just take the average value of the pixel.

And the result is that we have noticeable block artifacts,

even though our blocks are very, very small, three by three.

Can we do better?

Well in order to do better we have to introduce a second ingredient which is

transform coding.

To give you an idea of how transform coding works.

Assume you have a simple one-dimensional discrete time signal.

And assume that you will encode each sample with R bits per sample.

So if the signal is like this one here,

storing the signal as is will require N times R bits.

You have N samples, R bits per sample.

7:06

it turns out that the DFT looks like this.

Well, what happened was that the signal was a sinusoid that corresponded to one

of the basic factors for the length of the signal.

But with this representation in theory instead of having NR bits to

code the signal.

We just need to code two DFT coefficients possibly together with their position.

And so the amount of bits that we spent to code the signal is definitely

less than NR.

So the lesson that we get from this simple example is that we would like a transform

that when transforming an image block captures the important features of

the block in just a few coefficients.

So that we can just encode those coefficients and discard all the rest.

Also, we would like to transform to be efficient to compute

because we're interested in an efficient compression algorithm.

The answer to these two requirements is a transform

called the discrete cosine transform, or DCT for short.

Mathematically the DCT can be expressed as such.

It is very similar to the DFT.

Except that the basis functions are based on

cosines rather than on complex exponentials.

The advantage of this formulation is that the transform is purely a real

transform when applied to real signals such as images.

Instead of studying the structure of this double summation,

we can just exploit our intuition about basis expansions.

And realize that each coefficient of the DCT is just the inner product,

namely a measure of similarity between the image and each of the basis functions.

Now remember we're operating at block level.

So if for instance we're going to split our image,

as is customary in JPG, into eight by eight pixel blocks.

It means that N here is equal to 8.

And the same in the other direction.

And so we will have 64 basis functions for the space of eight by eight images.

We can actually plot these basis functions like so, and

here we have increasing values of the index K1 and of the index K2.

And we can see that by computing the DCT, we're actually correlating, namely

measuring the similarity between an image block and each one of these patterns here.

Some patterns will capture an edge like behavior.

Some patterns will capture a textural like behavior.

But in general, there will be a pattern that will be able to capture most

of the information contained in the small block.

And this is really the philosophy behind transform coding in JPEG.

Now, when it comes to smart quantization.

What we need to do is to change a little bit the definition of quantization that we

have seen in module 7.

The idea is that we want to be able to discard as many values

of the transform as possible.

And by discarding them, we mean that we set them to zero.

And therefore, we don't need to specify their amplitude.

This is achieved by using the so-called deadzone quantizers that we will

see in just a second.

The second part of smart quantization is being able to change the quantization step

according to the importance of the coefficients that we are quantizing.

So key coefficients will require a fine grained resolution.

Therefore more bits will be allocated to them,

whereas less important coefficients can be quantized rather coarsely.

So let's go back to the quantization model that we saw in module seven.

Suppose that our input is bounded between -2 and 2,

and that we're using two bits to quantize the input.

Then we're dividing the input range into four intervals,

because we have two bits per input sample.

And we are associating a representative value to each interval,

which happens to be the midpoint.

Mathematically in this case we can say that the quantized

value is simply the floor of the input value plus 0.5.

You can see however that in this model, there is no zero value in the output.

If the input is even slightly positive, it will get associated to 0.5,

and if it's slightly negative to -0.5, but there is no zero.

A deadzone quantizer on the other hand uses intervals that have the same width as

the standard quantizer.

But centers an interval around 0.

So all values in this case, from 0.5 to -0.5 will be mapped to 0.

So small values will be quanitzed to 0.

And then the rest proceeds with the usual staircase, so

everything is shifted by 0.5.

From 0.5 to 1.5, we will have a representative value of one.

Mathematically, this is equivalent to saying that the quantization scheme

simply rounds the input value to the nearest integer.

You can notice now that we have an asymmetric characteristic for

the quantizer.

So in a sense, we're wasting half a bit, we can add the level to the right side or

to the left side if needs be.

But the advantage of having a deadzone largely offsets the loss of half a bit.

And finally, let's consider the last ingredient in the success of JPEG

which is entropy coding.

The idea behind entropy coding is to minimize the effort to encode a certain

amount of information.

12:14

And the way we do that is by associating short symbols

to values that are frequently used.

Now, this might sound familiar to you and the reason is, because it is familiar.

The Morse code that was invented in 1836 is a typical example of a coding

scheme where we use very short symbols for frequently used values.

So, for instance the most common letter in the English alphabet is the letter e.

And therefore the symbol that we use to encode e is the shortest possible symbol

in the Morse alphabet, the dot.

And then we can find for

instance which is the second most common letter in the English alphabet.

Turns out to be I which gets two dots and so on, so forth.

So Morse code was invented to minimize the time that it would take to send

a message over a telegraph line.

But we can use exactly the same principle to try and

minimize the amount of memory that we must use to encode a bit stream.

We will see this in more detail in the next module.