Variational Bayesian methods
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
- To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
- To derive a lower bound for the marginal likelihood of the observed data. This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data.
Variational Bayes can be seen as an extension of the EM algorithm from maximum a posteriori estimation of the single most probable value of each parameter to fully Bayesian estimation which computes the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
Mathematical derivation
Problem
In variational inference, the posterior distribution over a set of unobserved variables given some data is approximated by a so-called variational distribution, :The distribution is restricted to belong to a family of distributions of simpler form than, selected with the intention of making similar to the true posterior,.
The similarity is measured in terms of a dissimilarity function and hence inference is performed by selecting the distribution that minimizes.
KL divergence
The most common type of variational Bayes uses the Kullback–Leibler divergence of P from Q as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined asNote that Q and P are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation-maximization algorithm.
Intractability
Variational techniques are typically used to form an approximation for:The marginalization over to calculate in the denominator is typically intractable, because, for example, the search space of is combinatorially large. Therefore, we seek an approximation, using.
Evidence lower bound
Given that, the KL-divergence above can also be written asBecause is a constant with respect to and because is a distribution, we have
which, according to the definition of expected value, can be written as follows
which can be rearranged to become
As the log evidence is fixed with respect to, maximizing the final term minimizes the KL divergence of from. By appropriate choice of, becomes tractable to compute and to maximize. Hence we have both an analytical approximation for the posterior, and a lower bound for the evidence .
The lower bound is known as the variational free energy in analogy with thermodynamic free energy because it can also be expressed as a negative "energy" plus the entropy of. The term is also known as Evidence Lower BOund, abbreviated as ELBO, to emphasize that it is a lower bound on the evidence of the data.
Proofs
By generalized Pythagorean theorem of Bregman divergence, of which KL-divergence is a special case, it can be shown that :.
where is a convex set and the equality holds if:
In this case, the global minimizer with can be found as follows :
in which the normalizing constant is:
The term is often called the evidence lower bound in practice, since, as shown above.
By interchanging the roles of and we can iteratively compute the approximated and of the true model's marginals and respectively. Although this iterative scheme is guaranteed to converge monotonically, the converged is only a local minimizer of.
If the constrained space is confined within independent space, i.e. the above iterative scheme will become the so-called mean field approximation as shown below.
Mean field approximation
The variational distribution is usually assumed to factorize over some partition of the latent variables, i.e. for some partition of the latent variables into,It can be shown using the calculus of variations that the "best" distribution for each of the factors can be expressed as:
where is the expectation of the logarithm of the joint probability of the data and latent variables, taken over all variables not in the partition.
In practice, we usually work in terms of logarithms, i.e.:
The constant in the above expression is related to the normalizing constant and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution.
Using the properties of expectations, the expression can usually be simplified into a function of the fixed hyperparameters of the prior distributions over the latent variables and of expectations of latent variables not in the current partition. This creates circular dependencies between the parameters of the distributions over variables in one partition and the expectations of variables in the other partitions. This naturally suggests an iterative algorithm, much like EM, in which the expectations of the latent variables are initialized in some fashion, and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters. An algorithm of this sort is guaranteed to converge.
In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined. The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters, but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves ; sometimes expectations of squared variables, or expectations of higher powers also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, nonlinear dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.
A basic example
Consider a simple non-hierarchical Bayesian model consisting of a set of i.i.d. observations from a Gaussian distribution, with unknown mean and variance. In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.For mathematical convenience, in the following example we work in terms of the precision — i.e. the reciprocal of the variance — rather than the variance itself.
The mathematical model
We place conjugate prior distributions on the unknown mean and precision, i.e. the mean also follows a Gaussian distribution while the precision follows a gamma distribution. In other words:The hyperparameters and in the prior distributions are fixed, given values. They can be set to small positive numbers to give broad prior distributions indicating ignorance about the prior distributions of and.
We are given data points and our goal is to infer the posterior distribution of the parameters and
The joint probability
The joint probability of all variables can be rewritten aswhere the individual factors are
where
Factorized approximation
Assume that, i.e. that the posterior distribution factorizes into independent factors for and. This type of assumption underlies the variational Bayesian method. The true posterior distribution does not in fact factor this way, and hence the result we obtain will be an approximation.Derivation of
ThenIn the above derivation,, and refer to values that are constant with respect to. Note that the term is not a function of and will have the same value regardless of the value of. Hence in line 3 we can absorb it into the constant term at the end. We do the same thing in line 7.
The last line is simply a quadratic polynomial in. Since this is the logarithm of, we can see that itself is a Gaussian distribution.
With a certain amount of tedious math, we can derive the parameters of the Gaussian distribution:
Note that all of the above steps can be shortened by using the formula for the sum of two quadratics.
In other words:
Derivation of
The derivation of is similar to above, although we omit some of the details for the sake of brevity.Exponentiating both sides, we can see that is a gamma distribution. Specifically:
Algorithm for computing the parameters
Let us recap the conclusions from the previous sections:and
In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:
Applying these formulas to the above equations is trivial in most cases, but the equation for takes more work:
We can then write the parameter equations as follows, without any expectations:
Note that there are circular dependencies among the formulas for and. This naturally suggests an EM-like algorithm:
- Compute and Use these values to compute and
- Initialize to some arbitrary value.
- Use the current value of along with the known values of the other parameters, to compute.
- Use the current value of along with the known values of the other parameters, to compute.
- Repeat the last two steps until convergence.
It can be shown that this algorithm is guaranteed to converge to a local maximum.
Note also that the posterior distributions have the same form as the corresponding prior distributions. We did not assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the exponential family, which is the case for most of the standard distributions.
Further discussion
Step-by-step recipe
The above example shows the method by which the variational-Bayesian approximation to a posterior probability density in a given Bayesian network is derived:- Describe the network with a graphical model, identifying the observed variables and unobserved variables and their conditional probability distributions. Variational Bayes will then construct an approximation to the posterior probability. The approximation has the basic property that it is a factorized distribution, i.e. a product of two or more independent distributions over disjoint subsets of the unobserved variables.
- Partition the unobserved variables into two or more subsets, over which the independent factors will be derived. There is no universal procedure for doing this; creating too many subsets yields a poor approximation, while creating too few makes the entire variational Bayes procedure intractable. Typically, the first split is to separate the parameters and latent variables; often, this is enough by itself to produce a tractable result. Assume that the partitions are called.
- For a given partition, write down the formula for the best approximating distribution using the basic equation .
- Fill in the formula for the joint probability distribution using the graphical model. Any component conditional distributions that don't involve any of the variables in can be ignored; they will be folded into the constant term.
- Simplify the formula and apply the expectation operator, following the above example. Ideally, this should simplify into expectations of basic functions of variables not in . In order for the variational Bayes procedure to work well, these expectations should generally be expressible analytically as functions of the parameters and/or hyperparameters of the distributions of these variables. In all cases, these expectation terms are constants with respect to the variables in the current partition.
- The functional form of the formula with respect to the variables in the current partition indicates the type of distribution. In particular, exponentiating the formula generates the probability density function of the distribution. In order for the overall method to be tractable, it should be possible to recognize the functional form as belonging to a known distribution. Significant mathematical manipulation may be required to convert the formula into a form that matches the PDF of a known distribution. When this can be done, the normalization constant can be reinstated by definition, and equations for the parameters of the known distribution can be derived by extracting the appropriate parts of the formula.
- When all expectations can be replaced analytically with functions of variables not in the current partition, and the PDF put into a form that allows identification with a known distribution, the result is a set of equations expressing the values of the optimum parameters as functions of the parameters of variables in other partitions.
- When this procedure can be applied to all partitions, the result is a set of mutually linked equations specifying the optimum values of all parameters.
- An expectation maximization type procedure is then applied, picking an initial value for each parameter and the iterating through a series of steps, where at each step we cycle through the equations, updating each parameter in turn. This is guaranteed to converge.
Most important points
- The idea of variational Bayes is to construct an analytical approximation to the posterior probability of the set of unobserved variables, given the data. This means that the form of the solution is similar to other Bayesian inference methods, such as Gibbs sampling — i.e. a distribution that seeks to describe everything that is known about the variables. As in other Bayesian methods — but unlike e.g. in expectation maximization or other maximum likelihood methods — both types of unobserved variables are treated the same, i.e. as random variables. Estimates for the variables can then be derived in the standard Bayesian ways, e.g. calculating the mean of the distribution to get a single point estimate or deriving a credible interval, highest density region, etc.
- "Analytical approximation" means that a formula can be written down for the posterior distribution. The formula generally consists of a product of well-known probability distributions, each of which factorizes over a set of unobserved variables. This formula is not the true posterior distribution, but an approximation to it; in particular, it will generally agree fairly closely in the lowest moments of the unobserved variables, e.g. the mean and variance.
- The result of all of the mathematical manipulations is the identity of the probability distributions making up the factors, and mutually dependent formulas for the parameters of these distributions. The actual values of these parameters are computed numerically, through an alternating iterative procedure much like EM.
Compared with expectation maximization (EM)
However, there are a number of differences. Most important is what is being computed.
- EM computes point estimates of posterior distribution of those random variables that can be categorized as "parameters", but only estimates of the actual posterior distributions of the latent variables. The point estimates computed are the modes of these parameters; no other information is available.
- VB, on the other hand, computes estimates of the actual posterior distribution of all variables, both parameters and latent variables. When point estimates need to be derived, generally the mean is used rather than the mode, as is normal in Bayesian inference. Concomitant with this, the parameters computed in VB do not have the same significance as those in EM. EM computes optimum values of the parameters of the Bayes network itself. VB computes optimum values of the parameters of the distributions used to approximate the parameters and latent variables of the Bayes network. For example, a typical Gaussian mixture model will have parameters for the mean and variance of each of the mixture components. EM would directly estimate optimum values for these parameters. VB, however, would first fit a distribution to these parameters — typically in the form of a prior distribution, e.g. a normal-scaled inverse gamma distribution — and would then compute values for the parameters of this prior distribution, i.e. essentially hyperparameters. In this case, VB would compute optimum estimates of the four parameters of the normal-scaled inverse gamma distribution that describes the joint distribution of the mean and variance of the component.
A more complex example
Note:
- SymDir is the symmetric Dirichlet distribution of dimension, with the hyperparameter for each component set to. The Dirichlet distribution is the conjugate prior of the categorical distribution or multinomial distribution.
- is the Wishart distribution, which is the conjugate prior of the precision matrix for a multivariate Gaussian distribution.
- Mult is a multinomial distribution over a single observation. The state space is a "one-of-K" representation, i.e. a -dimensional vector in which one of the elements is 1 and all other elements are 0.
- is the Gaussian distribution, in this case specifically the multivariate Gaussian distribution.
- is the set of data points, each of which is a -dimensional vector distributed according to a multivariate Gaussian distribution.
- is a set of latent variables, one per data point, specifying which mixture component the corresponding data point belongs to, using a "one-of-K" vector representation with components for, as described above.
- is the mixing proportions for the mixture components.
- and specify the parameters associated with each mixture component.
where the individual factors are
where
Assume that.
Then
where we have defined
Exponentiating both sides of the formula for yields
Requiring that this be normalized ends up requiring that the sum to 1 over all values of, yielding
where
In other words, is a product of single-observation multinomial distributions, and factors over each individual, which is distributed as a single-observation multinomial distribution with parameters for.
Furthermore, we note that
which is a standard result for categorical distributions.
Now, considering the factor, note that it automatically factors into due to the structure of the graphical model defining our Gaussian mixture model, which is specified above.
Then,
Taking the exponential of both sides, we recognize as a Dirichlet distribution
where
where
Finally
Grouping and reading off terms involving and, the result is a Gaussian-Wishart distribution given by
given the definitions
Finally, notice that these functions require the values of, which make use of, which is defined in turn based on,, and. Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them:
These results lead to
These can be converted from proportional to absolute values by normalizing over so that the corresponding values sum to 1.
Note that:
- The update equations for the parameters,, and of the variables and depend on the statistics,, and, and these statistics in turn depend on.
- The update equations for the parameters of the variable depend on the statistic, which depends in turn on.
- The update equation for has a direct circular dependence on,, and as well as an indirect circular dependence on, and through and.
- An E-step that computes the value of using the current values of all the other parameters.
- An M-step that uses the new value of to compute new values of all the other parameters.