In the previous lecture, we have shown the equivalence of a Gaussian dropout with
a special kind of variational base and inference.
So we have proved that Gaussian dropout certainly optimizes
the following ELBO and in this ELBO the second term doesn't
depend on theta so it can be ignored if we optimize only with respect to theta.
And now, the question is why not to optimize
both with respect to theta and alpha because remember that
our variational approximation our q_w depends on theta and also of alpha.
And remember that the more relational parameters we
have because it worked with the two Basuto distribution.
So we will only get our approximation better and better.
So then why not optimize ELBO with respect to theta and alpha?
It's important to know that
it wasn't possible until we came with a Bayesian interpretation of Gaussian dropout.
Really, It will try to optimize with just the first term,
the theta term with respect to both theta and alpha.
We would quickly end up with zero values of alpha. Why so?
Because we know that the maximum value of the first term is achieved when
our distribution is delta function at w_m_o and
delta function means zero variance and zero variance means zero alpha.
So we may obtain some non-zero values of alpha only if we optimize both terms,
the theta term and our regularizer.
So now our variational approximation looks as follows,
so this is fully factorized it goes in distribution with respect to all weights
w_i_j with the mean theta i_j and with variance alpha theta i_j squared.
But we may go even further.
Why not to assign individual to department to each weight?
Why not to say that our variational approximation looks as follows,
so this is fully factorized Gaussian distribution over w_i_j with
the mean theta i_j and with variants alpha i_j times theta i_j squared.
So we may now assign individual department,
individual alpha to each of the weights.
And again, this will make our approximation only tighter.
We only come close to the true posterior distribution.
But before we proceed let us examine the purposes
of our regulizer with dependence on alpha.
Remember that we may approximate it with a small differential function and we see
that the maximum value of this regularizer is achieved when alpha goes to plus infinity.
This means that the second term of our ELBO encourages a larger values of alphas.
And that's quite interesting because we may easily
prove that if alpha j goes to plus infinity
then the corresponding theta i_j which is the mean
in our variational approximation converts to zero.
In such a way that alpha i_j times theta i_j squared also converts to zero.
What this means that our variational approximation,
our q_w_i_j becomes delta function
when alpha j goes to plus infinity and delta function stands at zero.
And delta function stands at zero means that the corresponding w_i_j
is exactly zero and this means that we may simply skip this connection.
Simply remove the corresponding weight from
our neural network thus effectively sparcifying it.
So the whole procedure which is known as a sparse variational dropout looks as follows.
First, we assign log-uniform prior distribution over the weights,
which is fully factorized prior distribution.
Then I fix variational family of distributions q of w given theta alpha.
And again, this fully factorized distribution or o weight w_i_j with
a mean theta i_j and with variants given by alpha i_j times theta i_j squared.
And finally, we perform a stochastic variational inference trying to
optimize our ELBO both with respect to thetas and with respect to all alphas.
And in the end, we remove all weights
whose alphas exceeded some predefined large threshold.
And surprisingly, this procedure works quite well.
So in this picture you see the behavior of convolution kernels from
convolution layers and the fragments of weight matrix from fully connected squares.
So you see that as training progresses the more and more weights,
and the more and more coefficients and convolution kernels converge to zero.
The compression in fact exceeds 200 and pay attention that the accuracy doesn't decrease.
So we're keeping the same accuracy.
This is the baseline while effectively
compressing the whole network in hundredths of times.
This only became possible due to this Bayesian dropout.
So to conclude, it is known that modern deep architectures are very redundant.
But it is quite problematic to remove this redundancy and one of
the most successful ways to do this is
the Bayesian dropout or sparse variational dropout.
Variational Bayesian inference is a highly skilled procedure that allows
to optimize millions of
variational parameters and this
is just one of
many examples of successful combinations of Bayesian methods and deep learning.
More examples of successful application of Bayesian methods for DNNs you may find
in additional reading materials.