Introduction to pyTorch #1

The gradient descent algorithm

Introduction to pyTorch #1 : The gradient descent algorithm
January 25, 2018 nschaetti

Introduction to pyTorch

Deep-Learning has gone from breakthrough but mysterious field to a well known and widely applied technology. In recent years (or months) several frameworks based mainly on Python were created to simplify Deep-Learning and to make it available to the general public of software engineer. In this battle field to be the future framework of reference, some stand out such a Theano, Keras and especially Google’s TensorFlow and Facebook’s pyTorch. This article is the first of a series of tutorial on pyTorch that will start with the basic gradient descend algorithm to very advanced concept and complex models. The goal of this article is to give you a general but useful view of the gradient descent algorithm used in all the Deep-Learning frameworks.

Do not miss the two next articles :

  1. Introduction to pyTorch #2 : The linear regression;
  2. Introduction to pyTorch #3 : Image classification with CNN;

The gradient descent algorithm

The purpose of  the gradient descent algorithm is to find the minimum of a function. As you will see, all the training algorithms in machine learning consist in finding the minimum of a function which represents the difference between what we have (the output of a mathematical model) and what we want (the target output to be learned). We can start with a simple function, with two variables, and try to find the minimum with an iterative algorithm. The main idea behind GDA is to evaluate the gradient of a function at a certain point, and to move in the direction of the gradient, just like a stone would roll along a slope. More complex algorithm are more efficience but there are all based on this principle.

We will then start with a simple quadratic equation with two variables. We want to find the global minimum of this equation, global meaning here the smallest overall value of a function over its entire range. Of course this is not always possible as some functions are non-convex. In mathematics, a function is called convex if the line segment between any two points on the graph of the function lies above or on the graph. Non-convex functions have local minima, also called a relative minimum, which is a minimum within some neighbourhood that need not be a global minimum. You can see below an example of a convex and non-convex function

We want here to find the minimum of the following function.

(1)   \begin{equation*} z = \frac{1}{2}x^2 + x + \frac{1}{4}y^2 - 2 \end{equation*}

To find a minimum, the gradient algorithm works as follow,

  1. We take an initial point x_0 \in \mathbb{E};
  2. We compute the gradient at the current position x_k : \Delta f(x_k);
  3. We compute the next point : x_{k+1} = x_k - \alpha_k\Delta f(x_k), where \alpha_k is the learning rate;
  4. Repeat 2 and 3 for a specified number of iteration or until a stop condition fullfilled;

We start by importing some packages for argument parsing, plots, math and obviously numpy.

import argparse
import matplotlib.pyplot as plt
import math
import numpy as np
from mpl_toolkits.mplot3d import axes3d, Axes3D
from matplotlib import cm

We implement the function we want to minimise.

# Function
def func(xv, yv):
    """
    Function
    :param xv: X
    :param yv: Y
    :return: Z
    """
    return 0.5*math.pow(xv, 2) + xv + 0.25*math.pow(yv, 2) - 2
# end func

We add some arguments to our program :

    • The learning rate;
    • The number of iteration;
    • The x position of the starting point;
    • The y position of the starting point;
# Arguments
parser = argparse.ArgumentParser(prog=u"function_gradient")
parser.add_argument(u"--learning-rate", type=float, required=True)
parser.add_argument(u"--iterations", type=int, required=True)
parser.add_argument(u"--x", type=float, default=1.0)
parser.add_argument(u"--y", type=float, default=1.0)
args = parser.parse_args()

# Create two variables
x = args.x
y = args.y

The learning rate set the size of the step at each point, with a learning rate of 0.001, we will move in the direction of the gradient, multiplied by this learning rate. If the learning rate is to small, the convergence to the mimum will be to slow, in the contrary, if the learning rate is too big, an iteration could miss the minimum point and terminate in a point higher that before. Leading the algorithm to diverge from the minimum point.

We create three lists to start the results to display at the end.

 
# List of positions 
x_values = list([x]) 
y_values = list([y]) 
z_values = list([func(x, y)]) 

The main loop will iterate the number of time we specified in our command parameters.

# Do the iterations
for i in range(args.iterations):

The compute the gradient we must find the partial derivative for each arguments. If you do not know how to do a derivative, find some course on the internet as it is mandatory to understand the underlying concepts of deep-learning and machine learning. To compute the partial derivate for a variable, we consider the other variable as a constant. For the x variable, we compute the derivative for parts containing x and consider y as a constant, we then get the following result.

(2)   \begin{equation*} \frac{dz}{dx} = x + 1 \end{equation*}

We do the same thing for the y variable and we consider this time the x variable as a constant. We have our second partial derivative.

(3)   \begin{equation*} \frac{dz}{dy} = \frac{1}{2}y \end{equation*}

With this, we can compute the gradient of each variable in our Python code.

# Compute gradient
x_grad = x + 1
y_grad = 0.5 * y

The main idea of the gradient descent algorithm is to update to position in de direction of the gradient. The learning rate is here to module the importance of the deplacement.

# Update each parameters
x -= x_grad * args.learning_rate
y -= y_grad * args.learning_rate

We add the new position to our list and the corresponding z value of the function at this point.

# Print gradients and value
x_values.append(x)
y_values.append(y)
z_values.append(func(x, y))

That’s all for the iteration loop. When the iterations are finised, we add a subplot to a matplotlib figure.

# Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

We want to display the function surface and the path followed by the gradient descend algorithm. To create a 3D surface with matplotlib, we need each point on the x, y and z axis. We will set the range as -2.5 to 2.5 with a step of 0.1. We can use the Numpy’s function arange for X positions.

# X position
X = np.zeros((51, 51))
X[:, ] = np.arange(-2.5, 2.6, 0.1)

We have a X matrix with size 51 by 51, with values going from -2.5 to 2.5 on one axis. We now want an Y matrix of the same size with values varying on the other axis.

# Y position
Y = np.zeros((51, 51))
for i in range(51):
    Y[i, :] = X[0, i]
# end for

We then compute corresponding z value for each position.

 
# Compute Z
Z = np.zeros((51, 51))
for j in range(51):
    for i in range(51):
        x_pos = X[j, i]
        y_pos = Y[j, i]
        Z[j, i] = func(x_pos,y_pos)
    # end for
# end for

We add labels to the figure with axis names.

# Labels
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

We can now plot the 3D surface with our three matrices X, Y and Z and the plot_surface() function.

# Plot a basic surface.
ax.plot_surface(X, Y, Z, cmap=cm.hot, linewidth=0, antialiased=True, alpha=1.0)

And we use the function plot() function with the x, y, x values saved during the iterations.

# Scatter points
ax.plot(x_values, y_values, z_values, label='Learning curve', color='lightblue')

And we simply display the resulting figure with show(). The following plot show the path followed by the gradient descent algorithm with a learning rate of 0.1 and 10 iterations.

We cannot reach the minimum point, let’s try with a learning rate of 1.

The algorithm jumps directly near the minimum point, and slowly reach the minimum point. Let’s see what appends with a learning rate of 2.

The algorithm at the first step goes way to far a miss the minimum point, and is going higher than the preceding point.

Nils Schaetti is a doctoral researcher in Switzerland specialised in machine learning and artificial intelligence.

2 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*