Newton and Gradient Descent

Table of Contents

Newton and Gradient Descent can be both used to find roots (or in optimization), so how do they compare?

Newton’s method

For a function $f$, first take a initial guess $x_0$ and then iterate the following until a sufficiently precise value.

$$ x_{n+1}=x_n-\frac{f\left(x_n\right)}{f^{\prime}\left(x_n\right)} $$

Failures

Newton’s method is only guaranteed to converge if certain conditions are satisfied. If the assumptions made in the proof of quadratic convergence are met, the method will converge.

Bad starting point

  1. $f(x)=1-x^2$. when $x=0$, the derivative is $0$ and the iteration cannot move on.
  2. $f(x)=x^3-2 x+2$. When $x_0=0$, then $x_1=1$, and then $x_2=0$ again, which forms a circle.

Derivative issues

  1. $f(x)=\sqrt[3]{x}$. This will cause the divergence of the solutions. Why? $x_{n+1}=x_n-\frac{f\left(x_n\right)}{f^{\prime}\left(x_n\right)}=x_n-\frac{x_n{ }^{\frac{1}{3}}}{\frac{1}{2} x_n{ }^{-\frac{2}{3}}}=x_n-3 x_n=-2 x_n$.

Non-quadratic convergence

  1. If the first derivative is zero at the root, then convergence will not be quadratic, consider $f(x)=x^2$, then $x-\frac{f(x)}{f^{\prime}(x)}=\frac{x}{2}$, which is not quadratic.
  2. If there is no second derivative at the root, then convergence may fail to be quadratic.

Gradient Descent

Compared with the Newton’s, the gradient descent is a parametric algo, and we need to set up a learning rate.

$$ \mathbf{a}_{n+1}=\mathbf{a}_n-\gamma \nabla F\left(\mathbf{a}_n\right) $$

Example

Here is an example finding the roots of $x^N-a$. Instead of minimizing the absolute loss function, we modify the squared error, which is $\frac{1}{2}(x^N-a)^2$. The tolerance here is set to be $1e-6$, since the real error has also been squared, we need higher compensation to compensate.

def f(N, a, lr, tol = 1e-6):
    """Find the root of x^N - a = 0

    Args:
        N (_type_): the power of x
        a (_type_): a constant
        lr (_type_): learning rate
        tol (float, optional): tolerance. Defaults to 0.001.
    """
    def cost_function(x):
        return 0.5 * (x**N - a)**2
    def f_prime(x):
        return (x**N - a) * 2 * x
    
    # initial guess of x
    x = 1
    while cost_function(x) > tol:
        x = x - lr * f_prime(x)
        # print(x, f_value(x))
    return x

Convergence Rate

Define any first order differentiable function $f$ on any set $\mathcal{D}$, it is called L-smooth if For every $x, y \in \mathcal{D}$ :

$$ f(y) \leq f(x)+\langle\nabla f(x), y-x\rangle+\frac{L}{2}\|y-x\|_2^2 $$

The right hand side is really like a upper quadratic bound. If you recall that for a convex function, it also has a linear lower bound, which is $f(y) \geq f(x)+\langle\nabla f(x), y-x\rangle$.

A (second order differentiable) function over a convex set $\mathcal{D}$ is $L$-smooth if and only if: For every unit vector $v$, for every $x \in \mathcal{D}$ :

$$ v^{\top} \nabla^2 f(x) v \leq L $$

Gradient Descent Lemma when $\eta \leq \frac{1}{L}$ :

$$ f\left(x_{t+1}\right) \leq f\left(x_t\right)-\frac{\eta}{2}\left\|\nabla f\left(x_t\right)\right\|_2^2 $$

No matter the function is convex or not, the Gradient Descent Lemma guarantees that there exists a learning rate $\eta$ making the function converges in gradient (not necessarily the min function value, but a point where gradient equals to zero). The smoother the function is (meaning a small $L$), makes the convergence speed faster.

If additional condition that a function convex is also given, the the gradient 0 point is the global minimal point.

But in practice, even it will surely converge to the global minimal, but a very tiny gradient will lead to a tiny decrease in the variance, and it may take forever. So, we need to set a tolerance to stop the iteration at a point for example, $\left|\nabla f\left(x_T\right)\right|_2^2 \leq \varepsilon$.

Some observations:

  • gradient descent only converges in gradient, not function value. To get minimal, need to require a convex function
  • In practice, an $\epsilon$ is chosen to stop the iteration. BUT, a small gradient value does not imply a small function value. Example: $f(x)=\varepsilon^2 x^2$ for $\varepsilon>0$, then when $x=\frac{1}{\varepsilon}$, $|\nabla f(x)|=2 \varepsilon$ but $f(x)=1$

Comparison in Machine Learning Optimization Problem

The newton’s method is a root finding algorithm. In a convex optimization setup, we need to modify the problem to root finding by finding $x$ such that $f’(x)=0$. This requires $f(x)$ twice-differentiable, which is usually tricky in practice.

img

The figure here is from Wiki, illustrating a red line (from Newton’s) and a green line (gradient descent). The newton’s uses the curvature information (by second derivatives) also.

But the caveats are also clear for newton’s, which explains why we prefer the Gradient Descent in many cases.

  1. We need the Hessian, and it’s inverse.
  2. Convergence is not guaranteed.
  3. Even if it converges, it might be a “saddle” point.
Yiming Zhang
Yiming Zhang
Quantitative Researcher Associate, JP Morgan