Adam vs Steepest descent: Interactive Visualization

Introduction

Machine learning often involves iteratively minimizing a complex error function by using its current value and gradient to predict a new set of parameters, and repeat the process until the function converges close enough to a minimum. A simple way (Steepest Descent) to predict the next set of parameters is to shift them along the opposite direction to the local slope of the function. However, error functions can exhibit pathological features like local minima or stiff valleys that make this approach impractical or even unusable. This has motivated the adoption of more advanced optimization methods like the popular Adam optimizer.

In this article, our aim is to iteratively minimize the scalar function Equation by tuning its parameter vector Equation. The gradient of the function with respect to its parameters is written: Equation Different optimizers are defined and simple javascript implementations are provided. Their behavior is then compared on different optimization problems through interactive javascript simulations.

Binary Descent

As an introduction, let's start with the simplest approach: Binary Descent. The idea is to look at the slope (gradient) of the function for each parameter: if the gradient is positive, reduce the corresponding parameter by a fixed value Equation, and increase it the same way if the gradient is negative. Each parameter update follows: Equation

Here is a simple javascript implementation of Binary Descent: /* Binary Descent optimizer javascript implementation Code written by Damir Vodenicarevic (https://damip.net) in march 2017. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ */ function binary_descent_optimizer(nparams) { var self= this; self.nparams= nparams; self.get_update= function(alpha, g) { // alpha is the learning rate, g is the gradient var updates= []; for(var i= 0 ; i < self.nparams ; ++i) { updates[i]= alpha * (g[i] >= 0 ? -1 : +1); // Return parameter updates } return updates; }; }; And how to use this optimizer (and all the other optimizers presented here): /* Binary Descent optimizer javascript example usage to optimize a Sphere function Code written by Damir Vodenicarevic (https://damip.net) in march 2017. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ */ var theta= [1,1]; // Initialize parameters var sphere_function= function(theta) { var res= 0; for(var i= 0 ; i < theta.length ; ++i) { res += theta[i]*theta[i]; } return res; } var sphere_function_gradient= function(theta) { var res= []; for(var i= 0 ; i < theta.length ; ++i) { res[i]= 2*theta[i]; } return res; } var optimizer= new binary_descent_optimizer(theta.length); // Choose optimizer here var alpha= 0.05; // Learning rate var nsteps= 100; // Number of optimization steps for(var step= 0; step < nsteps ; ++step) { var update= optimizer.get_update(alpha, sphere_function_gradient(theta)); // get the update from the optimizer for(var i= 0 ; i < theta.length ; ++i) { theta[i] += update[i]; //apply the update } var value= sphere_function(theta); console.log('Current function value: ' + square_error); }

Let's see how Binary Descent behaves on optimizing a simple 2D Sphere function with a global minimum at Equation: EquationIn the simulation below, Equation is a vector in the represented 2D space, the background color shows the value of Equation with dark color representing lower values. The black line represents the trajectory of the Equation coordinates through the optimization process. Click/tap somewhere on the simulation to reset Equation to the corresponding location.

[Javascript loading if enabled...]

The simulation shows that Binary descent solves the sphere function minimization problem. However, because every parameter update can only be Equation, the descent can only go diagonally, which makes it zig-zag when trying to follow axis-aligned slopes. Moreover, since it can only move on a diagonal grid of spacing Equation, it never reaches the global minimum if it is not on the grid. One solution would be to reduce Equation in time (annealing), but even then it could still follow only diagonal descent directions, and can even get stuck.

A better approach would be to allow it to move non-diagonally, by updating each parameter by a value proportional to the amplitude of the gradient for this parameter. This is called Steepest Descent.

Steepest Descent

Steepest descent shifts the parameters "downhill", that is in the direction opposite to the gradient, proportionally to its magnitude and to a constant learning rate factor Equation : Equation The stiffer the slope, the bigger the update amplitude is. And the direction vector in which it updates the parameters vector is actually the direction having the biggest downhill slope.

Here is a simple javascript implementation of Steepest Descent: /* Steepest Descent optimizer javascript implementation Code written by Damir Vodenicarevic (https://damip.net) in march 2017. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ */ function steepest_descent_optimizer(nparams) { var self= this; self.nparams= nparams; self.get_update= function(alpha, g) { // alpha is the learning rate, g is the gradient var updates= []; for(var i= 0 ; i < self.nparams ; ++i) { updates[i]= - alpha * g[i]; // Return parameter updates } return updates; }; };

Let's see how Steepest Descent behaves on optimizing a simple 2D Sphere function :

[Javascript loading if enabled...]

Steepest Descent solves this problem pretty efficiently ! But that is because this is an extremely easy problem.

Now let's try with a function that has local minima. For this purpose, I have designed a radial function with a global minimum at Equation and local minima at Equation. I call it the Locmin function: Equation

[Javascript loading if enabled...]

Steepest Descent easily gets stuck into the local minimum circle (not always, sometimes it can just jump over it because of big step sizes). At a local minimum, the gradient is zero, therefore Steepest Gradient stops updating the parameters. Local minima are a real problem for optimization, and no magical solution exists to solve it entirely.

Finally, let's try on the painful Rosenbrock function, which is used to benchmark optimization algorithms and the following variant admits a global minimum at Equation: EquationThis function admits a crescent-shaped valley with stiff walls. The valley itself decreases very slowly towards the global minimum.

[Javascript loading if enabled...]

The two very different slopes (walls ane valley) make it impossible to choose a single good Equation value. Either Equation is too low for descending the valley in a reasonable amount of time, or too big and makes Steepest Descent diverge on the valley walls. Moreover, the shape of the function makes Steepest Descent zig-zag inside the valley, which slows it down further. Let's see how we can improve the situation with a better optimizer.

A more competitive algorithm: Adam

The Adam optimizer adds multiple tricks to avoid the previously mentionned problems. In relies on inertia to overcome local minima. Think about a ball rolling on a road downhill: if the ball is heavy enough, it will less likely get stuck in locally flat or uphill road areas but just cross through them thanks to its inertia. Having inertia is equivalent to predicting the slope at the next point by assuming that it should not differ too much from the recently encountered slopes. This is called first moment prediction. But adam goes further and also predicts the second moment using the square of the slope.

Here is a simple javascript implementation of Adam (Based on the original article: PDF from arXiv): /* Adam optimizer javascript implementation based on the original article: https://arxiv.org/pdf/1412.6980.pdf Code written by Damir Vodenicarevic (https://damip.net) in march 2017. This work is licensed under a Creative Commons Attribution 4.0 International License. http://creativecommons.org/licenses/by/4.0/ */ function adam_optimizer(nparams, beta_1=0.9, beta_2=0.999, epsilon=1e-8) { var self= this; self.nparams= nparams; self.beta_1= beta_1; self.beta_2= beta_2; self.epsilon= epsilon; self.m= []; self.v= []; for(var i= 0 ; i < self.nparams ; ++i) { self.m[i]= 0; // Initialize 1st moment vector self.v[i]= 0; // Initialize 2nd moment vector } self.t= 0; //Initialize timestep self.get_update= function(alpha, g) { // alpha is the learning rate, g is the gradient self.t += 1; var updates= []; for(var i= 0 ; i < self.nparams ; ++i) { self.m[i]= self.beta_1*self.m[i] + (1-self.beta_1)*g[i]; // Update biased first moment estimate self.v[i]= self.beta_2*self.v[i] + (1-self.beta_2)*g[i]*g[i]; // Update biased second raw moment estimate var mub= self.m[i]/(1 - Math.pow(self.beta_1,t)); // Compute bias-corrected first moment estimate var vub= self.v[i]/(1 - Math.pow(self.beta_2,t)); // Compute bias-corrected second raw moment estimate updates[i]= - alpha * mub / ( Math.sqrt(vub) + self.epsilon ); // Return parameter updates } return updates; }; };

To get some intuition, let's see how it behaves on the Sphere function:

[Javascript loading if enabled...]

As expected from a heavy ball thrown in a bowl, it does not simply go to the minimum and stop there, but oscillates a bit before stabilizing at the minimum. This makes it less efficient at solving the simple Sphere function optimization problem, but that's the price to pay to have an otherwise more robust algorithm.

Inertia helps with local minima. Here is an Adam optimization of the Locmin function:

[Javascript loading if enabled...]

Adam does not always reach the global minimum, but it is way more likely to power through local minima than Steepest Descent.

Finally, let's have a look a the Rosenbrock function:

[Javascript loading if enabled...]

Here again, Adam is way more robust for optimizing this hard-to-optimize function.

Conclusion

This article presents only a small subset of the existing optimization functions. Different optimization methods are adapted to different cases. However, the most interesting problems in Machine Learning for example present very tough error functions with many valleys, saddle points, local minima and so on. This is also why Adam has become an important tool for Machine Learning optimization problems.

Thanks to Adrien Vincent for taking the time to improve the color map. Thanks dude.