2:11
Then p star maximizes H(p) over all probability
distribution p on S, subject to the constraints in 1.
Note that p star of x, because
of the exponential form is always strictly positive
and so the support of p star is equal to S.
If we let q_i equals e to the power minus lambda_i, then we can
write p star of x equals e to the power minus lambda_0, e to the
power minus lambda_1 times r_1(x) all the way to e to the power minus lambda_m
times r_m(x),
which is equal to q_0 times q_1
to the power of r_1(x), all the way to q_m to the power of r_m(x).
7:52
Then p star maximizes the entropy of p over all probability
distribution p defined on S subject to the constraints
summation x in S_p p(x) times r_i(x) is equal to summation
x in S p star of x times r_i(x) for all i between 1 and m.
[BLANK_AUDIO]
The proof goes as follows.
Let summation x in S, p star of x, times r_i(x) be a_i for all i.
[BLANK_AUDIO]
Then, by construction, the parameters lambda_0, lambda_1 up to lambda_m
in p star of x are such that p star satisfies the constraints
summation x in S_p, p(x) times r_i(x) is equal to a_i for all i.
[BLANK_AUDIO]
Then the corollary is implied by theorem 2.50.
[BLANK_AUDIO]
Example 2.52 is an illustration of an application of theorem 2.50.
Let S be finite and let the set of constraints be empty.
Then p star of x is equal to e to
the power of minus lambda_0 which does not depend on x.
Therefore, p star is simply the uniform distribution over S, that
is p star of x is equal to the reciprocal of the size of S for all x in S.
This is consistent with our previous result, that over a
finite alphabet, the entropy is maximized by the uniform distribution.
[BLANK_AUDIO]
The next example is a little bit more elaborate.
Let S be the set of integers 0, 1, 2
so on and so forth.
And let the set of constraints be summation x p(x) times x equal to a
where a is greater than or equal to 0.
13:55
We consider maximizing the differential entropy of f subject to the constraint
integrating x square f(x)dx, that is
the expectation of X square equals kappa.
Then by theorem 10.41, f star of x has the form a times e to the power
minus b x square, which is the Gaussian distribution with 0 mean.
[BLANK_AUDIO]
In order to satisfy the second moment constraint, the only choices are a equals
1 over square root 2 pi kappa, and b equals 1 over 2 kappa.
[BLANK_AUDIO]
Next we are going to illustrate an application of corollary 10.42.
We start with a pdf of a Gaussian
random variable, with mean 0 and variance sigma square
and call this density function f star of x, which is equal to 1 over square
root 2 pi sigma square, e to the power minus x square over 2 sigma square.
15:04
To answer this question, we write f star of x equals e to the
power minus lambda_0, times e to the power minus lambda_1 times x square,
[BLANK_AUDIO]
where lambda_0 is equal to 1 half log 2 pi sigma
square, and lambda_1 is equal to 1 over 2 sigma square.
Then according to Corollary 10.42, f star maximizes the differential entropy,
over all density functions, subject to the second moment equal to sigma square.
[BLANK_AUDIO]
The next theorem says that, for a continuous random variable, with mean
mu and variance sigma square, the
differential entropy is upper bounded by one half
log 2 pi e sigma square, with equality if and only if
X is a Gaussian random variable, with mean mu and variance sigma square.
[BLANK_AUDIO]
The proof goes as follows.
Let X prime equals X minus mu.
16:36
And then by theorem 10.43, the differential entropy of X
prime, is less than or equal to one half log 2 pi e sigma square
because the second moment of x prime, is equal to sigma square.
[BLANK_AUDIO]
Equality holds if and only if X prime
is the Gaussian random variable with mean 0, and
variance sigma square, or X is the Gaussian
random variable with mean mu and variance sigma square.
[BLANK_AUDIO]
The following remark is somewhat subtle.
Theorem 10.43 says that with the constraints
that the second moment is equal to kappa,
the differential entropy is maximized by the
Gaussian distribution with mean 0 and variance kappa.
[BLANK_AUDIO]
If we impose the additional constraint, that the mean is equal to 0, then both
of the variance of X and the second moment of X are equal to kappa.
[BLANK_AUDIO]
By theorem 10.44, the differential entropy is still maximized
by the Gaussian distribution with 0 mean and variance kappa.
[BLANK_AUDIO]
Now we discuss a relation between differential
entropy, and the spread of the distribution.
From theorem 10.44, we have the differential entropy of
a random variable X is less than or equal
to one half log 2 pi e sigma square, where
sigma square is equal to the variance of X.
18:30
which is a measure of the spread of the distribution,
plus a constant, that is one half log 2 pi e.
In particular, as sigma, the standard deviation, tends
to 0, the differential entropy tends to minus infinity.
[BLANK_AUDIO]
The next 2 theorems, are the vector
generalizations of theorem 10.43, and 10.44 respectively.
Let X be a vector of n
continuous random variables, with correlation matrix K tilde.
Then the differential entropy of X is upper bounded by one half log two pi e
to the power n times the determinant of the correlation matrix K tilde
with equality, if and only if X is a Gaussian
vector with mean 0 and covariance matrix K tilde.
[BLANK_AUDIO]
Theorem 10.46 says that, for a random vector with mean nu and
covariance matrix K, the differential entropy is upper bounded by one half log 2 pi e
to the power n, times the determinant of K, with equality if and
only if X is a Gaussian vector with mean mu and covariance matrix K.
[BLANK_AUDIO]
We now prove theorem 10.45.
Define the function r_{ij}(x) to be x_i times x_j, and let
the (i,j)-th element of the matrix K tilde be k tilde ij.
Then the constraints on the pdf of the random vector X,
namely the requirement that the correlation matrix is equal to K
tilde, are equivalent to setting the integral of r_{ij}(x)
f(x)dx over the support of f to k tilde ij.
[BLANK_AUDIO]
It is because r_{ij}(x) is equal to x_i times x_j, and so this
integral is equal to the expectation of X_i times X_j,
that is the correlation between X_i and X_j, and this is for all i,j between 1 and n.
[BLANK_AUDIO]
Now by theorem 10.41, the joint pdf that
maximizes the differential entropy, has the form,
f star of x equals e to the power minus lambda_0,
minus summation over all i and j, lambda_{ij} x_i times x_j,
where x_i times x_j is r_{ij}(x).
[BLANK_AUDIO]
Here, the summation over all i,j, lambda_{ij} times x_i
times x_j can be written as x transpose L times x
[BLANK_AUDIO]
where L is an n by n matrix, with the (i,j)-th element equal to lambda_{ij}.
[BLANK_AUDIO]
Thus, f star is the joint pdf
of a multivariate Gaussian distribution with 0 mean.
[BLANK_AUDIO]
To see this, we only need to compare the form of f
star of x and the pdf of a Gaussian distribution with mean 0.
[BLANK_AUDIO]
Then for all i, j between 1 and n, the covariance between X_i and X_j is
equal to expectation of X_i times X_j, minus
the expectation of X_i times the expectation of X_j,
where the expectation of X_i is equal to 0,
and the expectation of X_j is also equal to 0.
23:31
which is a joint pdf of the Gaussian
distribution with mean 0 and variance matrix K tilde.
Hence, by theorem 10.20, which gives the differential entropy
of a Gaussian distribution, we have proved the upper
bound on the differential entropy of the random vector
X, in terms of the correlation matrix K tilde
with equality, if and only if X is a
Gaussian vector, with mean 0 and covariance matrix K tilde.