## Contents

Each year, the field of computer science becomes more sophisticated as new types of technologies hit the market. Despite that, the problem of developing intelligent agents that will precisely simulate human brain activity is still unsolved. One of the most prominent models of intelligent agents built in computer memory is represented by neural networks (NN). Thus in this article, the reader will be introduced to the basics of NN, alongside with the prediction pattern that can be successfully used in different types of "smart" applications. Specifically, a financial predictor based upon neural networks will be explored.

During my intellectual trip into the world of artificial intelligence, I was fascinated how "magically" a correctly constructed artificial neural network (specifically feed-forward network) can predict values, according to those specified at the input. This "forecasting" capability makes them a perfect tool for several types of applications:

- Function interpolation and approximation
- Prediction of trends in numerical data
- Prediction of movements in financial markets

All the examples are actually very similar, because in mathematical terms, you are trying to define a prediction function `F(X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>n</sub>)`

, which according to the input data (vector `[X<sub>1</sub>, X<sub>2</sub>, ..., X<sub>n</sub>]`

), is going to "guess" (interpolate) the output `Y`

. The most exciting domain of prediction lies in the field of financial market. An investment strategy based on computer intelligence sounds like a very prominent and interesting field of study. Next, I'm going to describe a relatively simple program which will attempt to predict *S&P500*, *DOW*, *NASDAQ Composite* indexes, and *Prime Interest Rate*, according to the input data which will be described shortly. Before going into details, I would like to warn you that the entire article is written for **educational purposes**, thus the described application cannot be used in real-world scenario.

The data that will be feed to neural network at the input, represents historical data of the *S&P500*, *DOW*, *NASDAQ Composite* and *Prime Interest Rate*. In general terms, these are leading indicators of stock market activity, which have a common fluctuation pattern.

The *S&P500* is a free-float capitalization-weighted index published since 1957 of the prices of 500 large-cap common stocks actively traded in the United States. The stocks included in the *S&P500* are those of large publicly held companies that trade on either of the two largest American stock market companies; the *NYSE Euronext* and the *NASDAQ OMX*. Actually, the *S&P500* is one of the most widely followed indexes of large-cap American stocks. It is considered a bellwether for the American economy, and is included in the Index of Leading Indicators. *S&P500* index fluctuations are dependent upon a lot of factors, thus the entire prediction pattern is very complex. In this application, the input data is represented only by historical items of 4 important economical indicators. It is essential to mention that if you want a better predictor, you should feed your neural network with more indicators that are more or less important for the entire interpolation.

As you can see in Figure 1, the value of the *S&P500* has generally increased over time, having a significant decrease in year 2000-2005.

*The Dow Jones Industrial Average (DJIA)*, also referred to as the Industrial Average, the Dow Jones, the Dow 30, or simply the Dow, is a stock market index, and one of several indices created by Wall Street Journal editor and Dow Jones & Company co-founder Charles Dow. It is an index that shows how 30 large, publicly owned companies based in the United States have traded during a standard trading session in the stock market. Along with the *NASDAQ Composite*, the *S&P500* Index, and the *Russell 2000* Index, the *Dow* is among the most closely watched benchmark indices tracking targeted stock market activity. To calculate the DJIA, the sum of the prices of all 30 stocks is divided by a Divisor, the Dow Divisor. The divisor is adjusted in case of stock splits, spinoffs or similar structural changes, to ensure that such events do not in themselves alter the numerical value of the DJIA.

The *NASDAQ Composite* is a stock market index of the common stocks and similar securities listed on the *NASDAQ* stock market, meaning that it has over 3,000 components. It is highly followed in the U.S. as an indicator of the performance of stocks of technology companies and growth companies. Since both U.S. and non-U.S. companies are listed on the *NASDAQ* stock market, the index is not exclusively a U.S. index.

Prime rate, or Prime Lending Rate, is a term applied in many countries to a reference interest rate used by banks. The term originally indicated the rate of interest at which banks lent to favored customers, i.e., those with high credibility, though this is no longer always the case. Some variable interest rates may be expressed as a percentage above or below prime rate. Generally, prime interest rate is a significant determinant in the world of financial marketing. This is because monetary policy is aimed at influencing domestic interest rates, which drive currency rates relative to other currencies with different interest rates. Domestic interest rates also influence overall economic activity, with lower interest rates typically stimulating borrowing, investment, and consumption, while higher interest rates tend to reduce borrowing, and increase saving over consumption. Below is shown Federal Funds Rate History graph. This data will be used in the current application.

Neural networks have been used with computers since the 1950s. Through the years, many different models have been presented. The perceptron is one of the earliest neural networks. It was an attempt to understand human memory, learning and cognitive processes. To construct a computer capable of "human-like thought", the researchers have used the only working model they have available - the human brain. However, the human brain as a whole is far too complex to model. Rather, the individual cells that make up the human brain are studied. Following is introduced the schema of the most used artificial neural network.

For the task of predicting the indexes, we'll be using the so called *multilayer feed forward network* which is the best choice for this type of application. In a feed forward neural network, neurons are only connected forward. Each layer of the neural network contains connections to the next layer, but there are no connections back. Typically, the network consists of a set of sensory units (source nodes) that constitute the input layer, one or more hidden layers of computation nodes, and an output layer of computation nodes. In its common use, most neural networks will have one hidden layer, and it's very rare for a neural network to have more than two hidden layers. The input signal propagates through the network in a forward direction, on a layer by layer basis. These neural networks are commonly referred as multilayer perceptrons (MLPs). Shown below is a simple MLP with 4 inputs, 1 output, and 1 hidden layer.

The **input** layer is the conduit through which the external environment presents a pattern to the neural network. Once a pattern is presented to the input layer, the output layer will produce another pattern. In essence, this is all the neural network does - it matches the input pattern to one which best fits the training's output. It is important to remember that the inputs to the neural network are floating point numbers, represented as `C# double`

type (most of the time you'll be limited to this type).

The **output** layer of the neural network is what actually presents a pattern to the external environment (the result of the computation). The number of output neurons should be directly related to the type of work that the neural network is to perform.

There are really two decisions that must be made regarding the **hidden** layers: how many hidden layers to actually have in the network and how many neurons will be in each of these layers. Problems that require two hidden layers are rarely encountered. There is currently no theoretical reason to use neural networks with any more than two hidden layers, thus almost all current problems solved by neural networks are fine with just one hidden layer. Even though the hidden layers do not directly interact with the external environment, they have a tremendous influence on the final output, thus you should carefully choose the number of neurons within it. Using too few neurons in the hidden layers will result in so called "under-fitting", which occurs when the hidden layers are not able to adequately detect the signals in a complicated data set. The "over-fitting" problem can occur, when the neural network has so much information processing capacity that the limited amount of information contained in the training set is not enough to train all of the neurons in the hidden layers. There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, here are just a few of them:

- The number of hidden neurons should be between the size of the input layer and the size of the output layer.
- The number of hidden neurons should be 2/3 the size of the input layer plus the size of the output layer.
- The number of hidden neurons should be less than twice the size of the input layer.

Multilayer perceptrons have been applied successfully to solve some difficult and diverse problems, by training them in a supervised manner with a highly popular algorithm known as the *error back-propagation algorithm* (described further). Please note that in our application, we will be using the Resilient propagation algorithm, which is very similar to back-propagation. The neural network itself will be composed from neurons (main information-processing units as neurons within a human brain) of the same kind, placed within different layers. They will exhibit the same characteristics; hence if you understand how one neuron is designed you will not have problems in understanding how the entire network works. Generally, the model of a neuron can be summarized in the following block diagram:

One can see that there are 3 basic element of a neuronal model:

- A set of synapses or connecting links, each characterized by a
*weight* or strength of its own: `X<sub>1</sub>,X<sub>2</sub>,...,X<sub>n</sub>`

with corresponding weights: `W<sub>k0</sub>, W<sub>k1</sub>,...,W<sub>km</sub>`

. As you will see further, the weights represent the "knowledge" that the neural network contains about a specific training data. Their values will directly affect the output of the neural network. - An adder for summing the input signals, weighted by the respective synapses of the neuron:
`V<sub>k</sub> = ∑(W<sub>kj</sub>X<sub>j</sub>+b<sub>k</sub>)`

, where `k=[1,r]`

, (r=number of neurons), `j=[1,m]`

(m=number of input synapses). Simply speaking - the input signal `X`

is multiplied by the weight `W`

and summed in the adder with all the other items. The result of this summation `V`

will go to the input of the activation function. - An activation function for limiting the output of a neuron:
`Y<sub>k</sub> = Φ(x)`

. The activation function has an important role in the schema of a neuron. It generates the output according to the summed input signals calculated in the adder. Summarized, the output signal of each neuron can be defined as follows: `Y<sub>k</sub> = Φ(∑(W<sub>kj</sub>X<sub>j</sub>+b<sub>k</sub>))`

. It is important to emphasize that if you want to use *Back Propagation* learning algorithm for training, then you should take care that your activation function is differentiable. This requirement comes from the fact that since this method requires computation of the Gradient of the error function at each iteration step, we must guarantee the continuity and differentiability of the error function. A commonly used non-linearity that satisfies this requirement is *sigmoid* non-linearity defined by the logistic function: `Φ(v) = 1/(1+exp(-αv))`

, where a is the slope parameter of the sigmoid function. By varying the parameter a, we obtain sigmoid functions of different slopes, as illustrated in the following figure (3 different a values):

Training is the means by which the weights and threshold values of a neural network are adjusted to give desirable outputs, thus making the network adjust the response to the value which best fits the training data. *Propagation Training* is a form of supervised training, where the expected output is given to the training algorithm. Propagation training can be a very effective form of training for feed-forward, simple recurrent and other types of neural networks. There are several forms of propagation training. We will analyze 2 of them.

*Back Propagation* algorithm is by far one of the most commonly used algorithms of learning. It is a supervised learning method, and is a generalization of the delta rule. It requires a teacher that knows, or can calculate, the desired output for any input in the training set.

Generally, it can be summarized in the following main steps:

- Present a training sample to the neural network.
- Compare the network's output to the desired output from that sample. Calculate the error in each output neuron.
- For each neuron, calculate what the output should have been, and a
*scaling factor*, how much lower or higher the output must be adjusted to match the desired output. This is the local error. - Adjust the weights of each neuron to lower the local error.
- Assign
*blame* for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. - Repeat from step 3 on the neurons at the previous level, using each one's
*blame* as its error.

In the below figure, one can visualize the process within which the neural network is trained to work as XOR logical gate.

Generally, XOR problem is considered the "Hello World" application in this field of science. The purpose is very straightforward: we will make our neural network "smart enough" to solve the XOR problem.

Truth table:

X_{1} | X_{2} | Result |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

The structure of the neural network is very simple: the input layer consists of 2 elements (XOR gate needs 2 Boolean values as input parameters, thus the input is of size 2). The hidden layer contains 3 neurons and finally the output layer has one, which represents the result of XOR operation. At its initial stage (`Iteration 0`

), the weights between the neurons are assigned random values, thus the network does not contain any valuable information by now. Once, we start using *Back Propagation* algorithm (`Iteration 1 - 59`

), the weights between the neurons are adjusted in a manner that will decrease the error rate, and will generate the output which we do expect. By `Iteration 59`

we achieve acceptable error rate, thus training process ends, and we can proudly say that the network contains enough "knowledge" to solve the XOR problem. By visualizing the way the values are changed, you can observe that at the initial iterations they fluctuate dramatically on each step (mathematically speaking the algorithm tries to find the steepest descent for the error function). Once the error value starts decreasing significantly, (`Iteration 30-59`

), the *weights* of the neural network are adjusted in a more granular fashion. The network was trained with 4 combination of XOR gate. Because of the 2D limitation, the figure itself contains an example of only 1 training set (`True - True`

(encoded as 1)), which ultimately should generate `False`

at the output (encoded as 0). If you are interested in more details related to this algorithm, please consult any available material related to it. I will not discuss the mathematics behind *Back Propagation* algorithm, because we'll use a framework which already has this algorithm implemented (`Encog framework`

).

One of the problems with the *Back Propagation* training algorithm is the degree to which the weights are changed. In order to understand better the way error decreases, consider the following error surface:

Our initial point resides within a place where the error value is highest. The goal of any training algorithm is to minimize the error function. In an ideal case, the algorithm will choose the path (from an infinite amount of paths) to the *global minimum*, thus achieving the best possible adjustment for the *weight* components. Unfortunately, *Back Propagation* algorithm doesn't handle well scenarios when the error surface contains local minims. There is a high probability that the path chosen will lead the error decrease in the direction of *local minima*. Once it will achieve the point where it cannot decrease anymore (getting stuck into the deepening), it will stop looking for new paths (simply speaking it won't be able to "jump" out from the local minima "hole"). In order to use a "smarter" way of searching the *global minimum*, the *Resilient Propagation* algorithm has been introduced. As the *Back Propagation* algorithm can often apply too large of a change to the weight matrix (delta parameter being too big, which may alter significantly the path chosen in the direction of error decrease), the *Resilient Propagation* training algorithms only use the sign of the gradient and not the value itself (which will allow it to minimize the chance of falling into the local minimum trap). Once the magnitude is discarded, this means it is only important if the gradient is positive, negative or near zero. The *Resilient Propagation* training (RPROP) algorithm is usually the most efficient training algorithm provided by Encog (framework used in this application) for supervised feed-forward neural networks. One particular advantage to the RPROP algorithm is that it requires no setting of parameters before using it. There are no learning rates, momentum values or update constants that need to be determined. This is good because it can be difficult to determine the exact learning rate that might be optimal.

As it was stated earlier, we are going to feed the neural network with the historical data of the indexes described above. One of the important things about the input data is that it will be formed of `10`

consecutive values (sorted by date) of each of the `4`

parameters (*S&P500, NASDAQ Composite, DOW, Prime Interest Rate*), total `40`

input values, corresponding to a granularity of `10`

business days. The network will try to predict the `11th`

value, corresponding to the next day in the row, of each of the indexes (`4`

output data). Speaking mathematically, `10`

previous points will be used to interpolate the next coordinate through which the function of *NASDAQ Composite, Dow, S&P500 and Prime Interest Rate* will pass.

Pairs used in prediction:

# | 1 | 2 | ... | 10 | 11 |

NASDAQ | 2288.55 | 2301.66 | ... | 2231.65 | ? |

DOW | 12376.72 | 12319.73 | ... | 12350.61 | ? |

S&P500 | 1110.88 | 1112.92 | ... | 1099.5 | ? |

PIR | 3.25 | 3.25 | ... | 3.25 | ? |

One of the important heuristics of making the neural network perform better relates to input normalization. Each input variable should be preprocessed so that its mean value, averaged over the entire training set, is close to zero, or else it is small compared to its standard deviation. The ranges of the indexes vary slightly, as their domain is totally different. In order to normalize each index range to `[-1, 1]`

, we are going to use the following simple formula: `Index(x) = (Index(x) - Min(Index))/(Max(Index) - Min(Index))`

. Thus each of the input variables will lie in the same range. Take a look at the regression plot of the training set. Each of the figures corresponds to a specific target from the output array. As all the `R`

parameters are very close to `1`

, this means that the correlation between the outputs and the targets is very high (regression plot can be performed using Neural network toolbox from MATLAB).

It is hard to underestimate the importance of normalization. If it won't be used, you most probably fail to train your network, because the weights won't be able to adjust accordingly.

`4`

outputs correspond to each of indexes on the input (*S&P500*, *DOW*, *NASDAQ Composite*, and *Prime Interest Rate*). The neural network job will be to find hidden patterns in the input data which influences the overall output. After training the network using `40-41-41-4`

topology (`40`

input units, `2`

hidden layers with `41`

units, `4`

outputs), and trying to predict the values, the following results have been obtained:

As one can see, the network is able to interpolate the results in a fairly good manner. The error rate summed over the entire training session decreased to a value of `~0.008`

. Of course, you cannot consider this data as an input to your investment strategy, since past information does not really indicates future returns (a more granular approach should be developed as the fluctuations are dependent of many other strategical data), but for the academical purpose we can consider this as a good result.

Neural networks have a great capability of finding hidden patterns and trends, if they are provided in the training session with a reasonable amount of input data and desired output. As the number of input parameters increases, the quality of prediction increases as well. Thus for a better indexes predictor, you would like to use more parameters than just the prime interest rate and indexes historical data. Anyway, as the purpose of this article is simplified, we'll feed the neural network just with the parameters specified above. During the application development, Encog framework was used in order to build and train the neural network. Personally, I consider this library to be the best choice for application that uses `Java`

or `.NET`

platform and require AI constructs. It has a lot of already written algorithms, so it can greatly help you in developing applications of this kind.

Next, you can see the training algorithm:

private void TrainNetwork(DateTime trainFrom, DateTime trainTo, TrainingStatus status)
{
if(_input == null || _ideal == null)
CreateTrainingSets(trainFrom, trainTo);
_trainThread = Thread.CurrentThread;
int epoch = 1;
ITrain train = null;
try
{
var trainSet = new BasicNeuralDataSet(_input, _ideal);
train = new ResilientPropagation(_network, trainSet);
double error;
do
{
train.Iteration();
error = train.Error;
if (status != null)
status.Invoke(epoch, error, TrainingAlgorithm.Resilient);
epoch++;
} while (error > MaxError);
}
catch (ThreadAbortException) { _trainThread = null; }
finally
{
train.FinishTraining();
}
_trainThread = null;
}

As you can see, the training procedure runs until the error rate becomes less than the `MaxError`

constant. In any case, one can abort the training session if such a need appears.

The network creation method is very straightforward. You specify the number of hidden units and layers at the input and get a new `BasicNetwork`

created. Each layer within the newly created network will have a hyperbolic tangent activation function - `ActivationTANH`

. You can view the `CreateNetwork`

method below:

private void CreateNetwork(int hiddenUnits, int hiddenLayers)
{
_network = new BasicNetwork {Name = "Financial Predictor",
Description = "Network for prediction analysis"};
_network.AddLayer(new BasicLayer(INPUT_TUPLES * INDEXES_TO_CONSIDER));
for (int i = 0; i < hiddenLayers; i++)
_network.AddLayer(new BasicLayer
(new ActivationTANH(), true, hiddenUnits));
_network.AddLayer(new BasicLayer
(new ActivationTANH(), true, OUTPUT_SIZE));
_network.Structure.FinalizeStructure();
_network.Reset();
}

Constants used in this method can be inferred from the article's description: `INPUT_TUPLES = 10`

(number of pairs used in prediction, corresponding to 10 business days), `INDEXES_TO_CONSIDER = 4`

, `OUTPUT_SIZE = 4`

, (4 indexes at the input/output).

In this article, the topic of neural networks and their prediction capabilities have been analyzed. Feed forward neural networks proved to be a reliable solution for applications that need to predict something. Generally speaking, function interpolation is one of the major fields of study in stock market environment. A strategy based upon technical indicators, can really help you in achieving good trading results. Of course, the application that is presented in this article cannot be used in a real world environment, because normally you would need not only an almost precise prediction, but also a program that will perform the market analysis in short bursts (each 15-30 seconds), opposite to the values predicted in this application (closing stock value). In order to achieve better results, you would rather want to combine classical trading strategy with one based upon real-time technical indicators. As for the studying purposes, the main objective has been achieved. It is important to mention that Encog framework for neural networks was used while developing the application. In my opinion, it is the best choice one can have while choosing an API for NN. Thanks for reading.

- 01 April 2011
- 03 April 2011
- Minor bug fixes implemented
`CreateNetwork`

method added in the code description

Victoria Moga - my beloved