0:05

MP3 is a shorthand for MPEG layer 3,

MPEG is a shorthand for the Motion Picture Expert Group.

And what this all means is that at one point in the 90s a lot of people,

a lot of experts got together and

agreed on a set of standards for video and audio compression and encoding.

0:27

MP3 turned out to be the most used audio digital format for

additional audio storage, streaming and playback.

And today a portable music device,

thanks to the MP3 encoding, can store up to 30,000 songs which

really means you can carry your entire music collection with you everywhere.

So in this video we will look at the technology behind the success of MP3 and

we will describe in detail how the MP3 encoder works.

You will see how older tools that you have learned in our DSP

class from free transform to filtering

from soundly into quantization they all come together in this application.

So how does the encoding and decoding process take place?

Suppose you start with a discrete time, sound, signal, x of n.

This is processed by the encoder and converted into a binary string.

The decoder will take that binary string and

convert it back into the sound signal y[n].

The goal of encoding and

decoding chain is to reduce the memory requirements to store the sound wave form.

And the real achievement of MP3 is its ability to greatly reduce

the amount of data needed to encode a file.

And this, at a very reasonable tradeoff, with respect to sound quality degradation.

The data reduction is determined by looking at the amount of memory

that is necessary to store the output of the encoder and

by comparing this quantity to the amount of memory if we have

wanted to store the original signal in an uncompressed format.

And remember that an uncoded raw audio file will require quite a bit of storage.

For instance, if we sample at 48 kilohertz which is the DVD standard,

and we use 16 bits per sample,

we will need 12 megabytes to store a single minute of audio and stereo.

On the other hand, a high quality MP3 will require just 1.5

megabytes which represents almost an order of a magnitude in data reduction.

2:32

To achieve this performance, the coding has to be done in a very clever way.

And one of the key ingredients in MP3 is a model of the human auditory system.

So MP3 does not attempt to preserve the original framework, but

rather it focuses on coding the elements of the way form

that are most important to our way of listening to music and hearing sounds.

2:57

In particular the distortion introduced by the encoder,

the loss of information introduced by the encoding mechanism is

placed in parts of the spectrum of the original signal that we cannot hear.

We will see that in more detail in just a minute.

3:13

As we said, the origins of MP3 date back to the 90s,

when the Moving Picture Expert Group, in short, MPEG,

was set up by the international standard organization to develop algorithms and

standards for audio and video compression.

3:35

had its origins in a set of compression algorithms that had been

developed in the 80s by the Hannover Institute in Germany.

We see a photo of the team here in this picture,

the MP3 standard was quickly embraced by the industry And

this wide spread acceptance is what decreed its success, ultimately.

3:58

Now let's try to understand how MP3 works using this simple block diagram.

Your input signal, x of n, enters a bank of subband filters.

There are thirty-two parallel filters that subdivide the input signal

into 32 independent channels that span the full spectral range of the input.

Each channel is then quantized independently using a very clever

method and the quantized sample are then formatted and

encoded in a continuous bit stream.

4:35

The quantization scheme is clever because the number of bits allocated to each

sub-band is dependent on the perceptual importance of each

sub-band with respect to the overall quality of the audio wave-form.

In other words, subbands that are deemed by the Psycho-Acoustic Model not to be

important or difficult to be perceived are allocated very few or no bits at all.

Whereas, the most perceptually relevant subbands

are allocated the bulk of the entire Bit budget.

The reason why we can say flee allocate different amounts

of beats to the different subbands is to be found in the so-called masking effect

of the human auditory system.

Supposed you have a sound with a strong component as in this picture here.

The blue line represent the spectrum of the sound.

And here with the red dot, we indicate the strong sense of a component.

When your ear listens to a sound like this, a masking effect takes

place whereby frequency components in the vicinity of the dominant peak

Are not heard unless they are louder than a given masking threshold.

In this figure, for example,

the masking threshold is indicated by the red dotted line and what it indicates

that anything in the spectrum that falls below the red line will not be heard and

therefore can be removed without any loss of perceptual quality.

Masking effect is something that we experience everyday.

Imagine being in a perfectly quiet room like in your home at night,

you can even hear your wristwatch ticking.

6:11

But of course, you wouldn't be able to hear that noise in normal conditions

during the day when a lot of other auditory stimuli are reaching your ears.

Although if you were to record the audio environment and analyze its spectrum,

you will see that it still contains the information about your wristwatch ticking.

The shape of the masking threshold is a function of the loudness and

the frequency of the dominant tone, and it has been determined experimentally

by running a lot of listening tests with human subjects.

Masking in the human ear takes place within critical bands, and critical bands

are portions of the spectrum that are treated by the ear as a single unit.

Everything that happens within a critical band can now be

further resolved by the ear.

So two different frequencies taking place in the same critical band

are perceived as a single tone.

There are approximately 24 critical bands in the human ear.

And here is a picture of their distribution and frequency.

As you can see, they get wider as we go up in frequency.

They follow a logarithmic scale, which means that the resolution power of the ear

is stronger at low frequencies, whereas at high frequencies were less discriminant.

And therefore, when we quantize things across critical bands,

we can probably fit more noise in the high frequencies that in low frequencies.

In the end, the purpose of the psychoacoustic model is to compute Compute

the minimum number of bits that we need to use to quantize each of the 32 subband

filter outputs, so that the perceptual distortion is as little as possible.

In the end we're given a non-uniformed bit allocation

which will allocate fewer bits to the bands where the masking is strongest.

8:00

Interestingly enough, the specifications of the psychoacoustic model are not part

of the MP3 standard, which means that manufacturers of MP3 encoders

can compete with better and better versions of their psycho-acoustic model.

In the end, the number of bits used for

each sub-band is sent along with the quantized data to the decoder, so

it doesn't really matter how this bit distribution has been generated.

From a technical point of view, as you can imagine there are a lot of fine details in

the inner workings of the psycho-acoustic model and the bit allocation procedure.

And we will not have time to examine all of this in this presentation.

But we can roughly sum up what happens inside a psycho-acoustic model like SAM.

First of all, remember that all processing is performed on subsequent Windows

of a given length, so the input signal comes in, and the stream of

input samples is cut into chunks of a given length, say 1024 samples.

9:11

First subband, we try to distinguish between tonal and non-tonal components,

components that have a strong sinusoidal shape and noise-like components.

We have looked at masking for total components, but

a similar type of masking takes place for non-tonal components.

And we will have to take that into account as well.

9:30

The individual mask in effect for tonal and non-tonal components is computed for

each critical band.

And then these results are summed together

to obtain a global masking curve for the audio frame that we're analyzing.

This masking curve is mapped on to the 32 subbands, and the number of bits that

we will use for each event is computed as a function of the signal to mask ratio.

The power of the signal versus the masking power for each critical band.

10:07

As we said, the input is split across a filter bank that

contains 32 filters isolating different parts of the spectrum.

These filters are implemented as 512 tap FIR's, and they're

followed by 32 times down sampler to provide the independence of band samples.

The filter prototype is a simple low pass with a cut off frequency of pi

over 64 and a total bandwidth of pi over 32.

The different sub bands are obtained by modulating the base filter with a cosine

at multiples of pi over 64 and the resulting filter bank looks like this.

We're showing the positive half of frequency access.

This would be the first low pass filter with the next zero.

This is the second one, the third, the fourth, and so

on covering the entire spectrum.

11:04

Now let's go back to the implementation of the filter bank.

As you can see here, from this block diagram, each branch in the filter bank

comprises an FIR filter of length 512 and a 32 time down sampler here.

What this means of course is that 31 out of 32 output samples of this filter

are discarded, and so this is of course a very wasteful implementation.

Let's try and make this a little bit more efficient,

this is actually explained in the MP3 standard We start with the equation

that expresses the output of the Subband number i as

a convolution of the impulse response of the filter for that branch with the input.

And here you see that the down sampling factor

translates to a factor of 32 in front of the input index.

We can now replace the expression for the impulse response of the filter as

the prototype impulse response tomes the modulating factor

that brings the filter to the proper position in the frequency band.

And then here we're going to apply a little trick,

we're going to express the index k as the sum of two indices.

Namely we're going to say the K is equal to 64 times an index p + q.

12:23

Where q ranges from 0 to a 63 and

p ranges from 0 to 7, okay?

So with this split of the summation, we can write the previous line as a double

summation for p that goes from 0 to 7 and for q that goes from 0 to

63 as the same term, the modulation term that was seen before.

The prototype impulse response and the input.

Where we have, again, we have made the substitution, K = 64p + q.

Okay, so with this trick,

we can actually simplify the first term of this double summation.

Consider the cosine term.

We can write cosine of pi over 64.

Times (2i + 1) times 64p plus some other term,

let's call it f(i, q)) And we don't really care about that.

Now here, 64 is canceled out and we're left with cosine 2ip pi

13:31

+ P pi + this term.

And now, well this is a multiple of 2 pi, so it doesn't influence the angle.

And here we have a multiple of pi and we know that cosine of

pi + alpha is equal to- Cosine of alpha and so

in the end what we can do is simplify this cosine as

cosine of pi over 64 times two i plus one, times q minus 16.

and add a term minus one to the power of p.

That we can move over to the second summation.

And we have a simplified quote unquote expression that looks like so.

An outside sum here that only involves the cosine modulation, and an inner

sum here which is a pre sub-sampled implementation of the filtering operation.

14:26

If we work out the indices and

convert this to an algorithmic procedure, this is what we need to do.

We will use a 512 tab input circular buffer, and we will shift

at each step thirty-two new input audio samples Starting from the newest.

So at anytime the circular buffer is holding 512 input

samples in time reversed order.

Then we take a new 512 point buffer and we fill it sample by sample with the product

between the prototype impulse response and the content of the circular buffer.

15:04

Next, We compute this intermediate quantity here which is

the sum of the contents of this new buffer 64p apart.

We can do that for 63 different points, And if you do the math,

there are 7 points that we have summed together for each q index.

Finally, each subband output is given by this sum here, well we're taking

the intermediate quantity, c of q that we computed before, and we modulate it

with the cosines at the frequencies that we have defined in the beginning.

15:39

And finally quantization,

this is where the great bit rate savings are going to be achieved.

MP3 uses uniform quantization of subband samples.

And the number of bits per sample in each subband

is determined by the psychoacoustic model as we explained before.

We also said before that MP3 works on Subsequent audio frames,

a frame being a window of input samples that is processed independently.

There are 36 samples per band and per frame in the MP3 standard.

And so

Since all of the 36 samples is going to be quantized by the same quantizer,

a rescaling is needed so that we're using the full range of the quantizer.

Remember how uniform quantization works A quantizer maps

at input interval to a set of quantization levels.

16:32

Of course you have to make sure that the range of your input signal matches

the range of the quantizer.

For instance, this quantizer expects the input to range from -1 to one But

if your actual input only lives in this small sub-interval,

you will not be able to make use of the full quantization range.

So re-scaling normally would imply perfect renormalization of the 36 samples.

By dividing the samples, by the largest sample in magnitude.

Of course in order for

the decoder to then reconstruct the actual levels of the input,

we would have to send this normalization factor along side with quantized data.

But this would require a lot of side information.

We would use 16 or 32 bits between code and normalization factor Instead,

the MPEG standard defines 16 predefined scale factors,

we would choose the one that best matches the actual range of the input and

only use four bits to communicate this range to the decoder.

Thanks to the fact of these predefined levels are set in stone.

17:39

Finally, the actual quantization is performed according to this formula where

b is the number of bits as provided by the psychoacoustic model, and

Qa and Qb functions of the number bits

are parameters that are encoded inside the MP3 standard.

Finally, let's listen to some examples.

We all know that MP3 works very well, so

what we want to concentrate on here is the importance of the variable

bit allocation across the sub-bands as performed by the psychoacoustic model.

So for a fixed bit budget we could choose to allocate the same

number of bits to all sub-bands.

This would be uniform bit allocation or we could use a psychoacoustic model and

allocate the bits smartly across sub-bands.

And so here are the examples starting with the original signal

[MUSIC]

Now let's listen to the same signal encoded with uniform bit allocation

[MUSIC]

And finally this is the result of a full pledge MP3 implementation

with psychoacoustically based bit allocation.

[MUSIC]

Of course in both the uniform and non uniform beta location and coding schemes,

the target bit rate was very very low in order to exacerbate defects

of quantization.

But the principle holds for all bit rates.