Machine learning concepts. Network training and evaluation

1. Building a network model according to the problem being solved

The neural network model consists of two layers – an LSTM layer and an output Dense layer. The reason for choosing an LSTM layer is the need to process sequences of time-related data and to find data correlations. To perform these operations is needed a layer with memory, such as the LSTM layer that is capable of detecting long-term dependencies. The Dense layer limits the number of output parameters to one (corresponding to the closing price), applying an activating function to the outputs of the previous layer. A linear activation function is chosen for the Dense layer so that the neural network can predict higher values ​​than those it was trained with. This can not be achieved with a hyperbolic tangent activation function or a logical sigmoidal activation function.

The input layer of the neural network has 4 neurons – one for each of the input parameters, the hidden LSTM layer – 128 and the output Dense layer – 1, which is the number of possible outputs (remember that we are having a regression task). The number of neurons in the LSTM layer can be changed in order to achieve better results and it is part of the hyperparameters’ tuning (trial and error process).

2. Setting up network hyperparameters

Choosing the right hyperparameters is essential for successful network training. Poorly selected hyperparameters may result in slow or no learning at all.

2.1. Batch size

Batch size is the number of examples propagated through the network. With a selected value of 100 for this hyperparameter, the algorithm divides the training data into groups of 100 records and trains the network with each of the groups. A good starting point when tuning the batch size is 32. Other common choices are 64 and 128.

2.2. Number of epochs and number of iterations

Each epoch represents a complete passage of all training data through the neural network. The small number of epochs is not enough for the network to learn good values ​​of its weight matrixes. Too many epochs lead to a problem known as overfitting, in which the neural network predicts the training data very accurately, but makes bad predictions on data it has never seen. The reason for this is that the network remembers the training data instead of learning the correlations between its items. The “Early stopping” technique can be used for selecting a suitable number of epochs in which the neural network stops learning at the point where the error function stops decreasing. Using this technique solves the overfitting problem.
The number of iterations is equal to the number of all training examples divided by the batch size value. If the data has 1000 examples and the batch size is 250, then the number of iterations is 4, ie. with 4 iterations, 1 epoch can be completed.

2.3. Learning rate

It is one of the most important hyperparamers. Too small or too large values ​​for the learning rate may lead to very poor, very slow or no training. Values ​​typically range from 0.1 to 1.10-6. 1.10-3 is a good starting point to experiment with.

2.4. Activation functions

The LSTM layer has a sigmoidal activating function used to control the LSTM gates because the sigmoidal function has output values ​​ranging from 0 to 1. A linear activation function is selected for the Dense layer activation function.

2.5. Loss function

The mean squared error function is used for solving this regression problem. The produced error by the neural network is measured as the arithmetic mean of the sum of the differences between the predictions and the actual observations on the degree of two. The following formula gives a better explanation of the equation.

3. Network training

After creating and configuring the neural network model, the training process takes place, where the network is trained on the training data. The following graph shows the process of training the neural network. The blue line represents the correct outputs for each example, and the orange one – the predictions made by the network.

4. Network testing

The testing determines how satisfactory the predictions made by the neural network are. The network predicts data that it has not seen, and the predicted values ​​are compared to the correct outputs. The smaller the deviations between the two values, the better predictions it makes. The following graph shows the process of testing the neural network. The blue line represents the correct outputs for each example, and the orange one – the predictions made by the network.

5. Comparing the results

After comparing the two graphs it can be easily seen that the errors on the test data are greater than those on the training. This is the expected result of the comparison. By looking at the test graph, we come to the conclusion that the results of the predictions are satisfactory. The neural network has managed to predict the trend of the cryptocurrency’s market closing price. The next thing we can do is to try other network configurations or fine tune the hyperparameters of the network.

6. Tuning the hyperparameters in order to achieve more satisfactory results

Here is a series of different combinations of hyperparametric values ​​in order to achieve more accurate predictions. The results of the experiments are presented in the following figures. Each of the combinations is trained using the “Early stopping” method.
Training graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 64, loss function error: 0.0040
Testing graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 64, loss function error: 0.0042
Training graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 128, loss function error: 0.0036
Testing graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 128, loss function error: 0.0038
Training graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 256, loss function error: 0.0037
Testing graph. Learning rate: 10-3, batch size: 32 number of neurons in the hidden layer: 256, loss function error: 0.0037
Training graph. Learning rate: 10-3, batch size: 32, number of neurons in the hidden layer: 512, loss function error: 0.0034
Testing graph. Learning rate: 10-3, batch size: 32 number of neurons in the hidden layer: 512, loss function error: 0.0037
Training graph Learning rate: 10-3, batch size: 64, number of neurons in the hidden layer: 512, loss function error: 0.0036
Testing graph. Learning rate: 10-3, batch size: 64, number of neurons in the hidden layer: 512, loss function error: 0.0036
Training graph. Learning rate: 10-3, batch size: 128, number of neurons in the hidden layer: 512, loss function error: 0.0029
Testing graph. Learning rate: 10-3, batch size: 128, number of neurons in the hidden layer: 512, loss function error: 0.0029
After the hyperparameter setting has been made, it is found that the most optimal configuration for the neural network is:
  • learning rate: 1.10-3
  • batch size: 128
  • number of neurons in the hidden layer: 512

7. Persisting the trained model. Exporting the model for further use and loading in other environments

Machine learning models are often trained in a Python environment and are used in production in another environment. The neural network for predicting cryptocurrency prices is modelled in Python and then loaded in Java environment.
To save the trained Python model, use the following expression:
model.save(‘model.h5’)
 The model is exported to a file named “model” with extension .h5. Apart from the architecture of the neural network, the weighting coefficients of the network as a result of its training are also recorded. In this way, it can be used directly for forecasting.
Loading the model in Java environment is done using the method importKerasSequentialModelAndWeights()of the KerasModelImport.class which is part of the DL4J machine learning library.
MultiLayerNetwork model = KerasModelImport.importKerasSequentialModelAndWeights(modelPath, true);