Least mean squares filter


Least mean squares algorithms are a class of adaptive filter used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean square of the error signal. It is a stochastic gradient descent method in that the filter is only adapted based on the error at the current time. It was invented in 1960 by Stanford University professor Bernard Widrow and his first Ph.D. student, Ted Hoff.

Problem formulation

Relationship to the Wiener filter

The realization of the causal Wiener filter looks a lot like the solution to the least squares estimate, except in the signal processing domain. The least squares solution, for input matrix and output vector
is
The FIR least mean squares filter is related to the Wiener filter, but minimizing the error criterion of the former does not rely on cross-correlations or auto-correlations. Its solution converges to the Wiener filter solution.
Most linear adaptive filtering problems can be formulated using the block diagram above. That is, an unknown system is to be identified and the adaptive filter attempts to adapt the filter to make it as close as possible to, while using only observable signals, and ; but, and are not directly observable. Its solution is closely related to the Wiener filter.

Definition of symbols

Idea

The basic idea behind LMS filter is to approach the optimum filter weights, by updating the
filter weights in a manner to converge to the optimum filter weight. This is based on the gradient descent algorithm. The algorithm starts by assuming small weights
and, at each step, by finding the gradient of the mean square error, the weights are updated.
That is, if the MSE-gradient is positive, it implies the error would keep increasing positively
if the same weight is used for further iterations, which means we need to reduce the weights. In the same way, if the gradient is negative, we need to increase the weights. The weight update equation is
where
represents the mean-square error and is a convergence coefficient.
The negative sign shows that we go down the slope of the error, to find the filter weights,, which minimize the error.
The mean-square error as a function of filter weights is a quadratic function which means it has only one extremum, that minimizes
the mean-square error, which is the optimal weight. The LMS thus, approaches towards this optimal weights by ascending/descending
down the mean-square-error vs filter weight curve.

Derivation

The idea behind LMS filters is to use steepest descent to find filter weights which minimize a cost function.
We start by defining the cost function as
where is the error at the current sample n and denotes the expected value.
This cost function is the mean square error, and it is minimized by the LMS. This is where the LMS gets its name. Applying steepest descent means to take the partial derivatives with respect to the individual entries of the filter coefficient vector
where is the gradient operator
Now, is a vector which points towards the steepest ascent of the cost function. To find the minimum of the cost function we need to take a step in the opposite direction of. To express that in mathematical terms
where is the step size. That means we have found a sequential update algorithm which minimizes the cost function. Unfortunately, this algorithm is not realizable until we know.
Generally, the expectation above is not computed. Instead, to run the LMS in an online environment, we use an instantaneous estimate of that expectation. See below.

Simplifications

For most systems the expectation function must be approximated. This can be done with the following unbiased estimator
where indicates the number of samples we use for that estimate. The simplest case is
For that simple case the update algorithm follows as
Indeed, this constitutes the update algorithm for the LMS filter.

LMS algorithm summary

The LMS algorithm for a th order filter can be summarized as

Convergence and stability in the mean

As the LMS algorithm does not use the exact values of the expectations, the weights would never reach the optimal weights in the absolute sense, but a convergence is possible in mean. That is, even though the weights may change by small amounts, it changes about the optimal weights. However, if the variance with which the weights change, is large, convergence in mean would be misleading. This problem may occur, if the value of step-size is not chosen properly.
If is chosen to be large, the amount with which the weights change depends heavily on the gradient estimate, and so the weights may change by a large value so that gradient which was negative at the first instant may now become positive. And at the second instant, the weight may change in the opposite direction by a large amount because of the negative gradient and would thus keep oscillating with a large variance about the optimal weights. On the other hand, if is chosen to be too small, time to converge to the optimal weights will be too large.
Thus, an upper bound on is needed which is given as
where is the greatest eigenvalue of the autocorrelation matrix. If this condition is not fulfilled, the algorithm becomes unstable and diverges.
Maximum convergence speed is achieved when
where is the smallest eigenvalue of R.
Given that is less than or equal to this optimum, the convergence speed is determined by, with a larger value yielding faster convergence. This means that faster convergence can be achieved when is close to, that is, the maximum achievable convergence speed depends on the eigenvalue spread of.
A white noise signal has autocorrelation matrix where is the variance of the signal. In this case all eigenvalues are equal, and the eigenvalue spread is the minimum over all possible matrices.
The common interpretation of this result is therefore that the LMS converges quickly for white input signals, and slowly for colored input signals, such as processes with low-pass or high-pass characteristics.
It is important to note that the above upperbound on only enforces stability in the mean, but the coefficients of can still grow infinitely large, i.e. divergence of the coefficients is still possible. A more practical bound is
where denotes the trace of. This bound guarantees that the coefficients of do not diverge.

Normalized least mean squares filter (NLMS)

The main drawback of the "pure" LMS algorithm is that it is sensitive to the scaling of its input. This makes it very hard to choose a learning rate that guarantees stability of the algorithm. The Normalised least mean squares filter is a variant of the LMS algorithm that solves this problem by normalising with the power of the input. The NLMS algorithm can be summarised as:

Optimal learning rate

It can be shown that if there is no interference, then the optimal learning rate for the NLMS algorithm is
and is independent of the input and the real impulse response. In the general case with interference, the optimal learning rate is
The results above assume that the signals and are uncorrelated to each other, which is generally the case in practice.

Proof

Let the filter misalignment be defined as, we can derive the expected misalignment for the next sample as:
Let and
Assuming independence, we have:
The optimal learning rate is found at, which leads to: