Reservoir Computing #1

The theory behind Echo State Networks

Introduction to Reservoir Computing (RNN) #1
January 22, 2018 nschaetti

Introduction

Feed-forward neural networks

Recurrent Neural Networks (RNN) are networks of artificial neurons having one or several connection cycles. They form a discrete time system which the state show a dynamical behaviour giving them an internal memory which allow them to compute sequences of data of arbitrary length. The RNNs are then dynamical systems compared to feed-forward network which are more similar to functions. The first kind of RNNs were developed in the early 80s : every neuron is connected to every other network’s unit. Some neurons are designed as input neurons, other as outputs and other as hidden. All other RNNs are specific cases of this architecture.

The Hopfield networks were developed by John Hopfield1 in 1982 have symmetrical connections and are not used to compute sequences of pattern but independent inputs. The network is trained with Hebbsian training and the convergence is guarantee. The Elman network2 are another kind of RNNs, structured in 3 layers similar to classical feed-forward neural networks, but with additional outer units connected to the 3 layers, maintaining a copy of preceding layers values. These networks can compute some tasks of time serie prediction which are out of reach of classical feed-forward networks.

Recurrent neural network

The classical method to train RNN is derived of the classical gradient descent algorithm. We adapt units weight in an iterative process according to the gradient \frac{dE}{dW} to minimise an error function E(\Hat{y},y). A variant of this method ofter met to train RNNs is the BackPropagation Through Time algorithm (BPTT)3 which can be used to train Elman networks. It consists to unfold a network trough temporal steps in order to use classical back-propagation. This method has a temporal complexity of O(N_x^2) for each weight update for one output. The Real-time Recurrent Learning algorithm (RTRL)4 is another example of this kind of method where the gradient estimation is done in an iterative way going further in time. The temporal complexity is O(N_x^4).

A new approach was proposed in 5, the Atiya-Parlos Recurrent Learning algorithm (APRL). It allows a faster convergence than preceding methods by updating the network’s parameters by estimating the gradient \frac{dE}{dx} regarding the neurons activation to push them in the right direction. A review of gradient descent methods to train RNNs can be found in [8]. RNNs stay promising and fascinating, because they are able to approximate any dynamical systems and are similar to biological networks as they have cycles.

Reservoir Computing

Classical RNNs are difficult to train by methods based on gradient descent for several reasons:

  1. A high temporal complexity slowing down the training process;
  2. Instability and bifurcations;
  3. Gradient vanishing through time and topology;
  4. Local minimas;

It’s in a context of slow and painful progress that a new approach called afterwards Reservoir Computing has been discovered by researchers in the field of machine learning under the name Echo State Network6 and by Neuroscientists as Liquid State Machine7. The concept of ESN was introduced in 2001 where a first definition was given. This idea comes from the fact that training only the output layer of an RNN randomly constructed is often enough to get excellent performances in practice. The training of an ESN is thus not only easier, since it is focused only on the output layer, but also because it results in solving a system of linear equations. Furthermore, the key concept is to separate the part where the computing is done and the output layer where the learning is done.

The Reservoir Computing paradigm has been discovered thanks to the observation that in the Atiya-Parlos Recurrent Learning, the output weights \stackrel{out}{W} change much more rapidly that the ones in the hidden layer. It is from this observation that the BackPropagation-DeCorellation (BPDC) algorithm, which consists to apply APRL only to output weights, has been introduced. The advantage is a shorter training time with, however, a bias toward the recent input signal history and a fast forget of input data.

The paper “Reservoir computing approaches to recurrent neural network training”8 gives the following resume of the Reservoir Computing history from the difficulties to train RNN to the full ESN model :

  1. 1990-2000. Descent gradient-based methods such as BackPropagation Through Time (BPTT) are used to train RNNs but with a slow and costly training time et limited to small networks;
  2. 2000s. The Atiya-Parlos Recurrent Learning (APRL) significantly accelerated the training phase using a different update dynamics between the hidden layer and the output layer;
  3. 2001. The application of simple linear methods to a recurrent layer allow a fast and powerful training phase leads to Echo State Network (ESN) 9;
  4. 2002. Neurosciences discovered the recurrent structures linked to output layers in the micro-columns of the cortex leading to Liquid State Machine (LSM);
  5. 2004. The BackPropagation-DeCorellation (BPDC) algorithm derived from the preceding observation allow a faster training phase by using APRL with output weight only.
  6. 2007. The field is united under the name \textit{Reservoir Computing};

ESN has been applied to a large field of scientific domains, from astrophysics to robotic motor control and interaction 10111213, temporal series forecasting and classification in finance and weather forecasting 14151617. Finally, it has been applied to image classification to the MNIST dataset 1819 with stacked architectures inspired from deep-learning 20.

Formalism

A temporal task consists of learning a function f from an input discrete signal u_t with t = 1, \dots, T, where f is defined by :

(1)   \begin{equation*} y_t = f(\dots, u_{t-1}, u_t) \end{equation*}

The goal is always to minimise E(\hat{y},y). The temporality lies in the functional dependency between observations and the importance of the inputs history. We can add noise v to the function f,

(2)   \begin{equation*} y_t = f(\dots,u_{t-1},u_t) + v_t \end{equation*}

The presence of noise limits the precision of learning. Most of tasks in machine learning (ML), temporal or not, cannot be effectively solve by a simple linear relation as the errors E(\hat{y},y) stay large for all parameters.

(3)   \begin{equation*} y_t = W u_t + w_0\hspace*{30pt}W \in \mathbb{R}^{N_y \times N_u} \end{equation*}

The main idea is to expand non-linearly the input signal u_t in a vector of higher dimension x_t \in \mathbb{R}^{N_x} and to use it with linear method. This is the idea behind several machine learning methods like Support Vector Machine, Radial Basic Function, Slow Feature Analysis and Feed-Forward Neural Networks, where the hidden layer is an expansion of higher dimensionality of the inputs. This method is part of the set of kernel methods, kernel functions or expansion function, and is traditionaly defined as K(u_t),

(4)   \begin{equation*} x_t = K(u_t) \end{equation*}

The choice of expansion function is done manually by a trial-and-error method and need some kind of experience. The function to learn is dependent on the history of the input signal u_t, K is then a memory function and returns the expansion of the input signal and of its history, potentially infinite

(5)   \begin{equation*} x_t = K(\dots, u_{t-1}, u_t) \end{equation*}

K having a potentially infinite number of parameters, we then define it recursively,

(6)   \begin{equation*} x_t = K((x_{t-1}, u_t) \end{equation*}

The x_t vector can be seen as a spatial transformation of the input signal u_t history. We can then use x_t and a linear method to get an approximation of y_t :

(7)   \begin{equation*} \hat{y}_t = \stackrel{out}{W}x_t = \stackrel{out}{W}K(x_{t-1}, u_t) \end{equation*}

A commonly used method, especially in signal generation task, is to expand the input signal and the last output of the target function. The model then become,

(8)   \begin{equation*} x_t = K(x_{t-1},u_t, y_{t-1})\hspace*{20pt}\hat{y}_t = \hspace*{4pt}\stackrel{out}{W}x_t =\hspace*{4pt}\stackrel{out}{W}K(x_{t-1},u_t,y_{t-1}) \end{equation*}

The goal of the training phase is to find weights of the output matrix \stackrel{out}{W} which minimise the error E(\hat{y},y). It is interesting to observe that the function K of equation 6 and 7 defines a non-autonomous dynamical system guided by the input signal u which can be implemented by every systems having that characteristics, either a natural neural network, an artificial one or physical systems.

Definition

The main kind of network used in the section comes directly from the follow equation, the highly non-linear dimensional vector x_t is defined by

(9)   \begin{equation*} x_t = g(\stackrel{in}{W}u_t + Wx_{t-1}),\hspace*{15pt}t = 1,...,T \end{equation*}

where x_t \in \mathbb{R}^{N_x}, with N_x the number of neurons in the reservoir, is its activation vector at time t. The matrix \stackrel{in}{W} \in \mathbb{R}^{N_x \times N_u}, with N_u the dimension of the input signal, represents the weights applied to the inputs u_t, and W \in \mathbb{R}^{N_x \times N_x} is the matrix of internal weights. We start usually by a null state x_t = 0 for the initial vector. The concept of Echo State Network was introduced in 2001 by Herbert Jaeger in 21 where the first definition is given

An Echo State Network (ESN) is a recurrent and randomly generator network of artificial sigmoid neurons guided by a uni- or multidimensional temporal signal, and neurons activation are treated with a linear method. The reservoir acts like a complex non-linear dynamical filter which transforms the input signals using a high dimensional temporal map22.

This idea comes from the observation that training only the output layer of a randomly generated RNN with specific characteristics is sometimes enough to get very good performances in pratice. The main idea consists to use a reservoir of neurons as a temporal non-linear expansion function with a memory-less output to approximate y_t, the expansion function acts like a transformation of an input space into a feature space. The Reservoir Computing method contains then :

  1. x_t : expansion vector of the input signal and of its history \dots, u_{t-1}, u_t;
  2. \hat{y}_t : a combination of neurons activation x-t to approximate the target signal y_t;

A. Reservoir where green arrows represents the links between the outputs and the reservoir, the red ones the links between the reservoir and the outputs. The dashed lines represents the direct optional lines between the inputs and the outputs. B. Links between the different ESN’s components.

The main idea behind Reservoir Computing can be expressed as follow : to separate the part doing the computation and the output where the training is done. The goal of the expansion vector x_t is to contain a set of rich features to make the training possible. However, good features for a task are not necessarily the same for another task. The famous principle in machine learning known as the no free-lunch (NFL), which express the fact that an algorithm with good performance for a set of tasks will have degraded performances on another set of tasks, apply here. The network’s outputs \hat{y} are then defined from equation 7 by using this time the state vector,

(10)   \begin{equation*} \hat{y}_t = g(\stackrel{out}{W}x_t) \end{equation*}

where \stackrel{out}{W} \in \mathbb{R}^{N_y \times N_x}, with N_y the number of outputs, and \stackrel{out}{W} the matrix of output weights between the reservoir and the outputs. We can then make two observations. First, the separation between the reservoir and the output where the training is done allows to add additional outputs while avoiding catastrophic inference where a new training invalidate preceding ones. The separation of computing device and outputs allow also a true parallelism where the transformation of the input and its history is done only one time, while extracting from these states more information by multiplying outputs. Secondly, the reservoir has the interesting characteristics of being in the same time a unit of computing and a working memory. This allows ESNs to have a biological plausibility, confirmed by the independent discovery of Liquid State Machines in the field of neuro-science.

Extended model

We can resume the equation ?? and extend the model by integrating connection between outputs and the reservoir,

(11)   \begin{equation*} x_t = g(\stackrel{in}{W}u_t + Wx_{t-1} + \stackrel{ofb}{W}y_{t-1}),\hspace*{15pt}t = 1,...,T \end{equation*}

where \stackrel{ofb}{W} \in \mathbb{R}^{N_x \times N_y} is the optional feedback weights matrix between the outputs and the reservoir. The feedback connections are used in signal generation tasks, where a reservoir is trained to generate data, for example a sinusoidal signal, figures and robot arm movements. A classical benchmark in this field is the task named figure 8 where the system learns to generate eight digits according to a series of parameters while staying stable.

In this model, the training use the target signal y as an input signal u for a defined period, named \textit{teacher forcing}, and then leave the system generate the future signal for a period named free-run. As for the prediction and classification, the performance can be evaluated with an error function E(\hat{y},y) and its stability through time. Finally, as classically in the neural network, be can add bias to the units of equation 11 to get the complete ESN model :

(12)   \begin{equation*} x_t = g(\stackrel{in}{W}u_t + Wx_{t-1} + \stackrel{ofb}{W}y_{t-1} + \stackrel{biais}{W}),\hspace*{15pt}t = 1,...,T \end{equation*}

Regarding the output layer, optional connections can exists directly between the inputs and the outputs, the outputs are then defined as

(13)   \begin{equation*} \hat{y}_t = g(\stackrel{out}{W}[x_t;u_t]) \end{equation*}

The output weight matrix is then \stackrel{out}{W} \in \mathbb{R}^{N_y \times (N_x + N_u)} and the operation [\cdotp;\cdotp] is defined as the vertical concatenation of two vectors.

Leaky-Integrator ESNs (Li-ESNs)

The idea behind Li-ESNs is to adapt the network’s dynamics to the temporal characteristics of the tasks to be learned. Formally, the dynamics of a Li-ESNs in continuous time is defined by \cite{jaeger2007optimization},

(14)   \begin{equation*} \dot{x} = \frac{1}{c}(-ax + f(\stackrel{in}{W} + Wx + \stackrel{ofb}{W}y)) \end{equation*}

(15)   \begin{equation*} y = g(\stackrel{out}{W}[x;u]) \end{equation*}

where c > 0 is the time constant common to the whole network, a > 0 is the leaky-rate of neurons, f is the sigmoid function (tanh(\dot)). When discretized with a step \delta and an sampled input signal u_{t\delta}, is defined by

(16)   \begin{equation*} x_{t+1} = (1 - \frac{a\delta}{c})x_t + \frac{\delta}{c}f(\stackrel{in}{W}(x_{t+1}\delta) + Wx_t + \stackrel{ofb}{W}y_t) \end{equation*}

(17)   \begin{equation*} y_t = g(\stackrel{out}{W}[x_t;u_{t\delta}]) \end{equation*}

This new definition allows to adapt the network’s dynamics to the one of the task to be learned by modifying the step. An alternative way consists to sub/sur-sampling the input signal, this depends on the computing ressources and the possible destruction of information resulting from the resampling. If \delta = c, we get :

(18)   \begin{equation*} x_{t+1} = (1 - a)x_t + af(\stackrel{in}{W}x_{t+1} + Wx_t + \stackrel{ofb}{W}y_t) \end{equation*}

(19)   \begin{equation*} y_t = g(\stackrel{out}{W}[x_t;u_{t}]) \end{equation*}

where a is the leaky-rate. An Li-ESNs with a = 1 is a simple ESN, which is then a particular case of Li-ESNs.

Reservoir construction

The original Reservoir Computing method introduced with ESNs in “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note23 is :

  1. Generate a random reservoir (\stackrel{in}{W},W);
  2. Execute reservoir steps with input u_t and collect the resulting states;
  3. Compute the output weights matrix \stackrel{out}{W} from step 2 using a linear method and minimising the quadratic average error E(\hat{y},y) between the output signal and the target signal;
  4. Use the trained network on a new signal u_t by computing the output \hat{y}_t from output weights found in 3;

Lots of parameters could be taken into consideration while creating the reservoir. More particularly, it is necessary to ensure the presence of the Echo State Property, which guarantees that the inputs are slowly vanishing through time and are not amplified.

Dynamics and reservoir’s properties

A reservoir is a dynamical system and can then display chaotic behaviours, in this case the input signal loose all influence in front of initial condition, leading to a drop in performances.

The Echo State Property24 defines that the effect of preceding states x_t and its preceding inputs u_t on the future states x_{t+k} must vanish gradually through time (k \rightarrow \infty) and should not be amplified.

It is then necessary to establish the conditions for this property to guarantee its presence during the reservoir construction phase. This point must be the object of a particular attention because it is has as consequence sub-optimal performances. The first measure introduced to establish a set of condition is the spectral radius.

The spectral radius of a matrix W, named \rho(W), is its highest absolute Eigen value.

In most publications, a spectral radius below the unit (\rho(W) < 1) is considered as necessary and sufficient for the Echo State Property. However, \rho(W) < 1 us necessary if the network’s inputs are null (u_t = 0) and then apply only to autonomous systems. This error lead to the wrong wide-spread practice consisting to resize the hidden layer’s weights so that that the spectral radius of W is below 1. This practice could lead to a under-realisation of ESN’s performances as in practice, the condition \rho(W) < 1 do not always insure the Echo State Property, and \rho(W) > 1 do not destroy the input signal if it is suffisantly strong. \rho(W) < 1 is neither a sufficient or necessary condition.

Furthermore, in “Re-visiting the echo state property25, the authors show that a spectral radius below the unit could lead to a set of bifurcations where transitions display an oscillatory behaviour, and give example of systems with a spectral radius above 1 and directed by a non-null input signal having the Echo State Property. They introduce a new suficient condition for the Echo State Property, for all input signal u_t and give a method to apply during the reservoir construction phase.

  1. Generate a random matrix W with non negative values, w_{ij} \geq 0;
  2. Multiply W by a factor such as \rho(W) < 1;
  3. Change the signs of a desired number of W weights to get also negative weights;

In practice, the condition \rho(W) < 1 is used most of the time, even if it does not guarantee the Echo State Property because it is less strict, but also because it creates reservoirs with this condition most of the time.

Training

Linear regression

The training consists to solve a system of linear equations to minimise the error function E(Y,\stackrel{out}{W}X). The matrix \stackrel{out}{W} can be solve by a linear regression. The training can defined by the following equation,

(20)   \begin{equation*} \stackrel{out}{W}X = Y \end{equation*}

where \stackrel{out}{W} \in \mathbb{R}^{N_y \times N_x} is the output weights matrix, X \in \mathbb{R}^{R_x \times T} is the matrix containing the reservoir states obtained during the training phase, and Y \in \mathbb{R}^{N_y \times T} is the matrix containing the target outputs through time, with N_x < T. Under normal form, we get

(21)   \begin{equation*} \stackrel{out}{W}XX^T = YX^T \end{equation*}

A naive solution is to find \stackrel{out}{W} is then

(22)   \begin{equation*} \stackrel{out}{W} = (YX^T)(XX^T)^{-1} \end{equation*}

In “Reservoir computing approaches to recurrent neural network training26, authors soundly note that the complexity of equation 22 does not depend on the training set size.

Moore-Penrose pseudo-inverse

Another direct method to solve equation 20 is to use the Moore-Penrose pseudo-inverse, \stackrel{out}{W} is then defined as

(23)   \begin{equation*} \stackrel{out}{W} = YX^+ \end{equation*}

This approach has a better numerical stability compared to 22 but is more memory greedy for big X matrices [20], which put constraints on the reservoir size and training set size. The best numerical stability allow the usage of the pseudo-inverse in the equation 22 to improve its stability, which become

(24)   \begin{equation*} \stackrel{out}{W} = YX^T(XX^T)^+ \end{equation*}

Ridge Regression

The Ridge Regression (RR) is often used to train ESNs. This method consists to add an additonal cost to the optimisation function to minimise the amplitude of output weights matrix \stackrel{out}{W}, reducing its sensibility to noise and to over-fitting. These points made the \textit{Ridge Regression} a method of choice for training ESN. Resuming equation 22, we introduce the regularisation parameter \lambda

(25)   \begin{equation*} \stackrel{out}{W} = YX^T(XX^T + \lambda I)^{-1} \end{equation*}

where I is the identify matrix I \in \mathbb{R}^{N_x \times N_x} and \lambda is the regularisation factor. The parameter \lambda does not have an absolute meaning and should be optimised for each reservoir, as it depends on the correlation matrix XX^T.

ESN with Python

In the next article, you will learn how to use a Python module named Oger to predict and classify time series.

References

  1. Neural networks and physical systems with emergent collective computational abilities, Hopfield, John J, Proceedings of the national academy of sciences
  2. Finding structure in time, Elman, Jeffrey L, Cognitive science
  3. Backpropagation through time: what it does and how to do it, Werbos, Paul J, Proceedings of the IEEE
  4. A learning algorithm for continually running fully recurrent neural networks, Williams, Ronald J and Zipser, David, Neural computation
  5. New results on recurrent network training: unifying the algorithms and accelerating convergence, Atiya, Amir F and Parlos, Alexander G, Neural Networks, IEEE Transactions on
  6. Reservoir computing approaches to recurrent neural network training, Lukoševičius, Mantas and Jaeger, Herbert, Computer Science Review
  7. Real-time computing without stable states: A new framework for neural computation based on perturbations, Maass, Wolfgang and Natschläger, Thomas and Markram, Henry, Neural computation
  8. Reservoir computing approaches to recurrent neural network training
  9. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert
  10. Timing-based control via echo state network for soft robotic arm, Kuwabara, Junichi and Nakajima, Kohei and Kang, Rongjie and Branson, David T and Guglielmino, Emanuele and Caldwell, Darwin G and Pfeifer, Rolf
  11. Echo state networks for mobile robot modeling and control, Plöger, Paul G and Arghir, Adriana and Günther, Tobias and Hosseiny, Ramin
  12. Event detection and localization in mobile robot navigation using reservoir computing, Antonelo, Eric A and Schrauwen, Benjamin and Dutoit, Xavier and Stroobandt, Dirk and Nuttin, Marnix, Artificial Neural Networks–ICANN 2007
  13. Mobile robot control in the road sign problem using reservoir computing networks, Antonelo, Eric and Schrauwen, Benjamin and Stroobandt, Dirk, Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on
  14. A comparative study of reservoir computing strategies for monthly time series prediction, Wyffels, Francis and Schrauwen, Benjamin, Neurocomputing
  15. Reservoir computing approach to Great Lakes water level forecasting, Coulibaly, Paulin, Journal of hydrology
  16. Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, Ferreira, A.A. and Ludermir, T.B. and de Aquino, R.R.B. and Lira, M.M.S. and Neto, O.N.
  17. Short-term stock price prediction based on echo state networks, Expert Systems with Applications, Xiaowei Lin and Zehong Yang and Yixu Song
  18. Reservoir Computing: Étude théorique et pratique en reconnaissance de chiffres manuscrits, Mémoire de Master, Schaetti, Nils and Couturier, Raphaël and Salomon, Michel
  19. Echo State Networks-based Reservoir Computing for MNIST Handwritten Digits Recognition, Schaetti, Nils and Salomon, Michel and Couturier, Raphaël, 19th IEEE International Conference on Computational Science and Engineering (CSE 2016)
  20. Real-time reservoir computing network-based systems for detection tasks on visual contents, Jalalvand, Azarakhsh and Van Wallendael, Glenn and Van de Walle, Rik, Computational Intelligence, Communication Systems and Networks (CICSyN), 2015 7th International Conference on
  21. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
  22. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
  23. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
  24. Reservoir computing approaches to recurrent neural network training
  25. Re-visiting the echo state property, Yildiz, Izzet B and Jaeger, Herbert and Kiebel, Stefan J, Neural networks
  26. Reservoir computing approaches to recurrent neural network training, Lukoševičius, Mantas and Jaeger, Herbert
Nils Schaetti is a doctoral researcher in Switzerland specialised in machine learning and artificial intelligence.

0 Comments

Leave a reply

Your email address will not be published. Required fields are marked *

*