# Introduction

Recurrent Neural Networks (RNN) are networks of artificial neurons having one or several connection cycles. They form a discrete time system which the state show a dynamical behaviour giving them an internal memory which allow them to compute sequences of data of arbitrary length. The RNNs are then dynamical systems compared to feed-forward network which are more similar to functions. The first kind of RNNs were developed in the early 80s : every neuron is connected to every other network’s unit. Some neurons are designed as input neurons, other as outputs and other as hidden. All other RNNs are specific cases of this architecture.

The Hopfield networks were developed by John Hopfield^{1} in 1982 have symmetrical connections and are not used to compute sequences of pattern but independent inputs. The network is trained with Hebbsian training and the convergence is guarantee. The Elman network^{2} are another kind of RNNs, structured in 3 layers similar to classical feed-forward neural networks, but with additional outer units connected to the 3 layers, maintaining a copy of preceding layers values. These networks can compute some tasks of time serie prediction which are out of reach of classical feed-forward networks.

The classical method to train RNN is derived of the classical *gradient descent algorithm*. We adapt units weight in an iterative process according to the gradient to minimise an error function . A variant of this method ofter met to train RNNs is the *BackPropagation Through Time* algorithm (BPTT)^{3} which can be used to train Elman networks. It consists to unfold a network trough temporal steps in order to use classical back-propagation. This method has a temporal complexity of for each weight update for one output. The *Real-time Recurrent Learning* algorithm (RTRL)^{4} is another example of this kind of method where the gradient estimation is done in an iterative way going further in time. The temporal complexity is .

A new approach was proposed in ^{5}, the *Atiya-Parlos Recurrent Learning* algorithm (APRL). It allows a faster convergence than preceding methods by updating the network’s parameters by estimating the gradient regarding the neurons activation to push them in the right direction. A review of *gradient descent methods* to train RNNs can be found in [8]. RNNs stay promising and fascinating, because they are able to approximate any dynamical systems and are similar to biological networks as they have cycles.

# Reservoir Computing

Classical RNNs are difficult to train by methods based on gradient descent for several reasons:

- A high temporal complexity slowing down the training process;
- Instability and bifurcations;
- Gradient vanishing through time and topology;
- Local minimas;

It’s in a context of slow and painful progress that a new approach called afterwards Reservoir Computing has been discovered by researchers in the field of machine learning under the name Echo State Network^{6} and by Neuroscientists as Liquid State Machine^{7}. The concept of ESN was introduced in 2001 where a first definition was given. This idea comes from the fact that training only the output layer of an RNN randomly constructed is often enough to get excellent performances in practice. The training of an ESN is thus not only easier, since it is focused only on the output layer, but also because it results in solving a system of linear equations. Furthermore, the key concept is to separate the part where the computing is done and the output layer where the learning is done.

The *Reservoir Computing* paradigm has been discovered thanks to the observation that in the *Atiya-Parlos Recurrent Learning*, the output weights change much more rapidly that the ones in the hidden layer. It is from this observation that the *BackPropagation-DeCorellation* (BPDC) algorithm, which consists to apply APRL only to output weights, has been introduced. The advantage is a shorter training time with, however, a bias toward the recent input signal history and a fast forget of input data.

The paper “Reservoir computing approaches to recurrent neural network training”^{8} gives the following resume of the *Reservoir Computing* history from the difficulties to train RNN to the full ESN model :

**1990-2000**. Descent gradient-based methods such as*BackPropagation Through Time*(BPTT) are used to train RNNs but with a slow and costly training time et limited to small networks;**2000s**. The*Atiya-Parlos Recurrent Learning*(APRL) significantly accelerated the training phase using a different update dynamics between the hidden layer and the output layer;**2001**. The application of simple linear methods to a recurrent layer allow a fast and powerful training phase leads to*Echo State Network*(ESN)^{9};**2002**. Neurosciences discovered the recurrent structures linked to output layers in the micro-columns of the cortex leading to*Liquid State Machine*(LSM);**2004**. The*BackPropagation-DeCorellation*(BPDC) algorithm derived from the preceding observation allow a faster training phase by using APRL with output weight only.**2007**. The field is united under the name \textit{Reservoir Computing};

ESN has been applied to a large field of scientific domains, from astrophysics to robotic motor control and interaction ^{10}^{11}^{12}^{13}, temporal series forecasting and classification in finance and weather forecasting ^{14}^{15}^{16}^{17}. Finally, it has been applied to image classification to the MNIST dataset ^{18}^{19} with stacked architectures inspired from deep-learning ^{20}.

# Formalism

A *temporal task* consists of learning a function from an input discrete signal with , where is defined by :

(1)

The goal is always to minimise . The temporality lies in the functional dependency between observations and the importance of the inputs history. We can add noise to the function ,

(2)

The presence of noise limits the precision of learning. Most of tasks in machine learning (ML), temporal or not, cannot be effectively solve by a simple linear relation as the errors stay large for all parameters.

(3)

The main idea is to expand non-linearly the input signal in a vector of higher dimension and to use it with linear method. This is the idea behind several machine learning methods like Support Vector Machine, Radial Basic Function, Slow Feature Analysis and Feed-Forward Neural Networks, where the hidden layer is an expansion of higher dimensionality of the inputs. This method is part of the set of kernel methods, *kernel functions* or *expansion function*, and is traditionaly defined as ,

(4)

The choice of expansion function is done manually by a trial-and-error method and need some kind of experience. The function to learn is dependent on the history of the input signal , is then a memory function and returns the expansion of the input signal and of its history, potentially infinite

(5)

having a potentially infinite number of parameters, we then define it recursively,

(6)

The vector can be seen as a spatial transformation of the input signal history. We can then use and a linear method to get an approximation of :

(7)

A commonly used method, especially in signal generation task, is to expand the input signal and the last output of the target function. The model then become,

(8)

The goal of the training phase is to find weights of the output matrix which minimise the error . It is interesting to observe that the function of equation 6 and 7 defines a non-autonomous dynamical system guided by the input signal which can be implemented by every systems having that characteristics, either a natural neural network, an artificial one or physical systems.

# Definition

The main kind of network used in the section comes directly from the follow equation, the highly non-linear dimensional vector is defined by

(9)

where , with the number of neurons in the reservoir, is its activation vector at time . The matrix , with the dimension of the input signal, represents the weights applied to the inputs , and is the matrix of internal weights. We start usually by a null state for the initial vector. The concept of *Echo State Network* was introduced in 2001 by Herbert Jaeger in ^{21} where the first definition is given

An

Echo State Network(ESN) is a recurrent and randomly generator network of artificial sigmoid neurons guided by a uni- or multidimensional temporal signal, and neurons activation are treated with a linear method. The reservoir acts like a complex non-linear dynamical filter which transforms the input signals using a high dimensional temporal map^{22}.

This idea comes from the observation that training only the output layer of a randomly generated RNN with specific characteristics is sometimes enough to get very good performances in pratice. The main idea consists to use a reservoir of neurons as a temporal non-linear expansion function with a memory-less output to approximate , the expansion function acts like a transformation of an *input space* into a *feature space*. The *Reservoir Computing* method contains then :

- : expansion vector of the input signal and of its history ;
- : a combination of neurons activation to approximate the target signal ;

The main idea behind *Reservoir Computing* can be expressed as follow : to separate the part doing the computation and the output where the training is done. The goal of the expansion vector is to contain a set of rich features to make the training possible. However, good features for a task are not necessarily the same for another task. The famous principle in machine learning known as the no *free-lunch* (NFL), which express the fact that an algorithm with good performance for a set of tasks will have degraded performances on another set of tasks, apply here. The network’s outputs are then defined from equation 7 by using this time the state vector,

(10)

where , with the number of outputs, and the matrix of output weights between the reservoir and the outputs. We can then make two observations. First, the separation between the reservoir and the output where the training is done allows to add additional outputs while avoiding *catastrophic inference* where a new training invalidate preceding ones. The separation of computing device and outputs allow also a true parallelism where the transformation of the input and its history is done only one time, while extracting from these states more information by multiplying outputs. Secondly, the reservoir has the interesting characteristics of being in the same time a unit of computing and a working memory. This allows ESNs to have a biological plausibility, confirmed by the independent discovery of *Liquid State Machines* in the field of neuro-science.

# Extended model

We can resume the equation ?? and extend the model by integrating connection between outputs and the reservoir,

(11)

where is the optional feedback weights matrix between the outputs and the reservoir. The feedback connections are used in signal generation tasks, where a reservoir is trained to generate data, for example a sinusoidal signal, figures and robot arm movements. A classical benchmark in this field is the task named figure 8 where the system learns to generate eight digits according to a series of parameters while staying stable.

In this model, the training use the target signal as an input signal for a defined period, named \textit{teacher forcing}, and then leave the system generate the future signal for a period named *free-run*. As for the prediction and classification, the performance can be evaluated with an error function and its stability through time. Finally, as classically in the neural network, be can add bias to the units of equation 11 to get the complete ESN model :

(12)

Regarding the output layer, optional connections can exists directly between the inputs and the outputs, the outputs are then defined as

(13)

The output weight matrix is then and the operation is defined as the vertical concatenation of two vectors.

# Leaky-Integrator ESNs (Li-ESNs)

The idea behind *Li-ESNs* is to adapt the network’s dynamics to the temporal characteristics of the tasks to be learned. Formally, the dynamics of a *Li-ESNs* in continuous time is defined by \cite{jaeger2007optimization},

(14)

(15)

where is the time constant common to the whole network, is the *leaky-rate* of neurons, is the sigmoid function (). When discretized with a step and an sampled input signal , is defined by

(16)

(17)

This new definition allows to adapt the network’s dynamics to the one of the task to be learned by modifying the step. An alternative way consists to sub/sur-sampling the input signal, this depends on the computing ressources and the possible destruction of information resulting from the resampling. If , we get :

(18)

(19)

where is the *leaky-rate*. An *Li-ESNs* with is a simple ESN, which is then a particular case of *Li-ESNs*.

# Reservoir construction

The original *Reservoir Computing* method introduced with ESNs in “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note“^{23} is :

- Generate a random reservoir (,);
- Execute reservoir steps with input and collect the resulting states;
- Compute the output weights matrix from step 2 using a linear method and minimising the quadratic average error between the output signal and the target signal;
- Use the trained network on a new signal by computing the output from output weights found in 3;

Lots of parameters could be taken into consideration while creating the reservoir. More particularly, it is necessary to ensure the presence of the *Echo State Property*, which guarantees that the inputs are slowly vanishing through time and are not amplified.

# Dynamics and reservoir’s properties

A reservoir is a dynamical system and can then display chaotic behaviours, in this case the input signal loose all influence in front of initial condition, leading to a drop in performances.

The

Echo State Property^{24}defines that the effect of preceding states and its preceding inputs on the future states must vanish gradually through time () and should not be amplified.

It is then necessary to establish the conditions for this property to guarantee its presence during the reservoir construction phase. This point must be the object of a particular attention because it is has as consequence sub-optimal performances. The first measure introduced to establish a set of condition is the *spectral radius*.

The

spectral radiusof a matrix W, named , is its highest absolute Eigen value.

In most publications, a *spectral radius* below the unit () is considered as necessary and sufficient for the *Echo State Property*. However, us necessary if the network’s inputs are null () and then apply only to autonomous systems. This error lead to the wrong wide-spread practice consisting to resize the hidden layer’s weights so that that the *spectral radius* of is below 1. This practice could lead to a under-realisation of ESN’s performances as in practice, the condition do not always insure the *Echo State Property*, and do not destroy the input signal if it is suffisantly strong. is neither a sufficient or necessary condition.

Furthermore, in “Re-visiting the echo state property“^{25}, the authors show that a *spectral radius* below the unit could lead to a set of bifurcations where transitions display an oscillatory behaviour, and give example of systems with a *spectral radius* above 1 and directed by a non-null input signal having the *Echo State Property*. They introduce a new suficient condition for the *Echo State Property*, for all input signal and give a method to apply during the reservoir construction phase.

- Generate a random matrix with non negative values, ;
- Multiply by a factor such as ;
- Change the signs of a desired number of weights to get also negative weights;

In practice, the condition is used most of the time, even if it does not guarantee the *Echo State Property* because it is less strict, but also because it creates reservoirs with this condition most of the time.

# Training

## Linear regression

The training consists to solve a system of linear equations to minimise the error function . The matrix can be solve by a linear regression. The training can defined by the following equation,

(20)

where is the output weights matrix, is the matrix containing the reservoir states obtained during the training phase, and is the matrix containing the target outputs through time, with . Under normal form, we get

(21)

A naive solution is to find is then

(22)

In “Reservoir computing approaches to recurrent neural network training“^{26}, authors soundly note that the complexity of equation 22 does not depend on the training set size.

## Moore-Penrose pseudo-inverse

Another direct method to solve equation 20 is to use the Moore-Penrose pseudo-inverse, is then defined as

(23)

This approach has a better numerical stability compared to 22 but is more memory greedy for big matrices [20], which put constraints on the reservoir size and training set size. The best numerical stability allow the usage of the pseudo-inverse in the equation 22 to improve its stability, which become

(24)

## Ridge Regression

The *Ridge Regression* (RR) is often used to train ESNs. This method consists to add an additonal cost to the optimisation function to minimise the amplitude of output weights matrix , reducing its sensibility to noise and to over-fitting. These points made the \textit{Ridge Regression} a method of choice for training ESN. Resuming equation 22, we introduce the regularisation parameter

(25)

where is the identify matrix and is the regularisation factor. The parameter does not have an absolute meaning and should be optimised for each reservoir, as it depends on the correlation matrix .

# ESN with Python

In the next article, you will learn how to use a Python module named Oger to predict and classify time series.

# References

- Neural networks and physical systems with emergent collective computational abilities, Hopfield, John J, Proceedings of the national academy of sciences
- Finding structure in time, Elman, Jeffrey L, Cognitive science
- Backpropagation through time: what it does and how to do it, Werbos, Paul J, Proceedings of the IEEE
- A learning algorithm for continually running fully recurrent neural networks, Williams, Ronald J and Zipser, David, Neural computation
- New results on recurrent network training: unifying the algorithms and accelerating convergence, Atiya, Amir F and Parlos, Alexander G, Neural Networks, IEEE Transactions on
- Reservoir computing approaches to recurrent neural network training, Lukoševičius, Mantas and Jaeger, Herbert, Computer Science Review
- Real-time computing without stable states: A new framework for neural computation based on perturbations, Maass, Wolfgang and Natschläger, Thomas and Markram, Henry, Neural computation
- Reservoir computing approaches to recurrent neural network training
- The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert
- Timing-based control via echo state network for soft robotic arm, Kuwabara, Junichi and Nakajima, Kohei and Kang, Rongjie and Branson, David T and Guglielmino, Emanuele and Caldwell, Darwin G and Pfeifer, Rolf
- Echo state networks for mobile robot modeling and control, Plöger, Paul G and Arghir, Adriana and Günther, Tobias and Hosseiny, Ramin
- Event detection and localization in mobile robot navigation using reservoir computing, Antonelo, Eric A and Schrauwen, Benjamin and Dutoit, Xavier and Stroobandt, Dirk and Nuttin, Marnix, Artificial Neural Networks–ICANN 2007
- Mobile robot control in the road sign problem using reservoir computing networks, Antonelo, Eric and Schrauwen, Benjamin and Stroobandt, Dirk, Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on
- A comparative study of reservoir computing strategies for monthly time series prediction, Wyffels, Francis and Schrauwen, Benjamin, Neurocomputing
- Reservoir computing approach to Great Lakes water level forecasting, Coulibaly, Paulin, Journal of hydrology
- Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, Ferreira, A.A. and Ludermir, T.B. and de Aquino, R.R.B. and Lira, M.M.S. and Neto, O.N.
- Short-term stock price prediction based on echo state networks, Expert Systems with Applications, Xiaowei Lin and Zehong Yang and Yixu Song
- Reservoir Computing: Étude théorique et pratique en reconnaissance de chiffres manuscrits, Mémoire de Master, Schaetti, Nils and Couturier, Raphaël and Salomon, Michel
- Echo State Networks-based Reservoir Computing for MNIST Handwritten Digits Recognition, Schaetti, Nils and Salomon, Michel and Couturier, Raphaël, 19th IEEE International Conference on Computational Science and Engineering (CSE 2016)
- Real-time reservoir computing network-based systems for detection tasks on visual contents, Jalalvand, Azarakhsh and Van Wallendael, Glenn and Van de Walle, Rik, Computational Intelligence, Communication Systems and Networks (CICSyN), 2015 7th International Conference on
- The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
- The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
- The “echo state” approach to analysing and training recurrent neural networks-with an erratum note, Jaeger, Herbert, Bonn, Germany: German National Research Center for Information Technology GMD Technical Report
- Reservoir computing approaches to recurrent neural network training
- Re-visiting the echo state property, Yildiz, Izzet B and Jaeger, Herbert and Kiebel, Stefan J, Neural networks
- Reservoir computing approaches to recurrent neural network training, Lukoševičius, Mantas and Jaeger, Herbert