Machine learning concepts. Data preparation

Each machine learning process related to the use of neural networks consists of at least two parts. The first part is related to data loading and preparation for training. The action is known as ETL (extract, transform, load). The second part concerns the actual training of the network. The overall process can be divided into the following parts and steps.

I. Extract, Transform, Load

1. Selection of input variables according to the problem being solved;

2. Structuring the data in a format suitable for loading it in the software environment;

3. Loading data into the environment;

4. Conversion of the data into an appropriate form;

5. Data separation - training and test data;

6. Data normalization.

II. Network training and evaluation

1. Building a network model according to the problem being solved;

2. Setting up network hyperparameters;

3. Network training;

4. Network testing;

5. Comparing the results;

6. Tuning the hyperparameters in order to achieve more satisfactory results;

7. Persisting the trained model. Exporting the model for further use and loading in other environments.

In the next paragraphs we will go deeper in understanding each of this steps and apply them to our problem – predict the close price of a cryptocurrency one day ahead.

I. Extract, Transform, Load

1. Selection of input variables according to the problem being solved

An appropriate set of input variables must be chosen to make the network accurately predict the closing price of the cryptocurrency for the current day. Forecasting market prices and, in particular, forecasting a cryptocurrency trend is not a trivial problem. A neural network won’t be able to predict future prices by only being given the previous trend of the cryptocurrency. The reason for this is that no cyclicality can be observed in the cryptocurrency trends. The neural network is unable to predict future jumps or dips that it has not seen before without additional parameters suggesting similar upcoming events. The graph below shows the Bitcoin trend since the cryptocurrency was created. The X axis reflects time distribution, and the Y axis reflects Bitcoin’s prices in thousands of dollars.
For additional parameters to help in the training of the network, three numerical parameters are chosen. Each one of them is a sentiment based on news classification and expresses the development of the cryptocurrency. Estimates are in the scale of 1 to 10, the higher the estimate, the better the expectations are for the development of the cryptocurrency. The news reflect global events such as wars, financial crises, disasters and others, which have a strong impact on crypto trading and crypto prices, respectively. They provide additional information without which the neural network is incapable of making predictions that accurately reflect reality. The network training parameters are as follows:
  • price_close
  • news_sentiment
  • twitter_sentiment
  • reddit_sentiment

2. Structuring the data in a format suitable for loading it in the software environment

The input data is structured in a four-column text file, each of which contains values ​​for the corresponding parameter. The text file has the following structure.
close_price news_sentiment twitter_sentiment reddit_sentiment
7725.43; 7.167166666666668; 5.851458333333333; 5.006708333333333;
7603.99; 6.361833333333332; 4.2411666666666665; 4.29825;
7533.92; 6.479833333333335; 5.355999999999999; 3.1264583333333333;
7414.08; 6.669958333333335; 4.936625; 3.6385000000000014;
7009.99; 6.775500000000003; 3.5362916666666657; 4.339458333333334;
Input data for the machine learning algorithm
Each row of the text file matches the values ​​for one day. Continuous sequence is important for successfully solving the regression problem. Any lack of information violates the completeness of consistency and leads to inaccuracies in training and forecasting with the data. As a separator in the text file is used the semicolon sign. The extra spacing in the example data is only for clarity.

3. Loading data into the environment

For loading data in the software environment are almost always used libraries which ease the process. Most of the tools have additional options for visualition of the input data that help in better understanding of the data. Missing or incorrect input values can be discovered while reviewing the data.

4. Conversion of the data into an appropriate form

For training the neural network, data needs to be appropriately transformed. The problem being solved falls in the supervised learning class, where for each set of parameters describing an example, an output value for this example is also given. Thus, for each example, the neural network compares its assumption with the true value of the output. It minimizes its error function by a technique such as the “Gradient descent” method, adjusting its weights matrix coefficients.
The loaded data contains the entire sequence of days for which there is information about the close price and estimates of the world news, but this data is not in the proper form for machine learning. Each row of data must be matched with a value reflecting what the correct output of the prediction should be. In this case the correct output is the closing price for the next day. After performing transformation on the data, it is in the form as shown below. An example ,representing one day, has values ​​of 7725.43, 7.16717, 5.85146 and 5.00671. The correct output is 7603.99 – close price for the next day.
close_price news_sentiment twitter_sentiment reddit_sentiment
7725.43; 7.16717; 5.85146; 5.00671;
7603.99; 6.36183; 4.24117; 4.29825;
7533.92; 6.47983; 5.356; 3.12646;
7414.08; 6.66996; 4.93663; 3.6385;
7009.99; 6,78; 3,54; 4,34;
Input features for the machine learning algorithm
close_price
7603.99;
7533.92;
7414.08;
7009.99;
Output features for the machine learning algorithm

5. Data separation - training and test data

Machine learning data must be divided into two sets, the first of which is used in the network training process, and through which the neural network adjusts the weight coefficients of its layers, and the second one contains test data that has not participated in the process of network training. Test data is used to evaluate the accuracy of the predictions. It is a way to understand how the network responds to new unseen data. The ratio between train and test data sets varies. In the algorithm for predicting cryptocurrency prices it is 8:2. 80% of the data is used for training the network and the other 20% of the data is used for evaluating the accuracy of the predictions.

6. Data normalization

Data normalization is a technique often used in the process of machine learning data preparation. The purpose of normalization is to convert the values ​​of the input variables in such way, so that they belong to the same numerical range. If the variables belong to different numerical ranges, those whose values ​​exceed the values ​​of others, will have a greater impact on the output. In our case, news-based estimates have values ​​in the range [1-10], while the market closing price varies within the range [3000-10000]. Close price has a significant advantage over the other variables. The four input variables have the same importance for the problem we are solving, which is why data normalization is needed. Another reason for the need for normalization is that the neural network is trained by the gradient descent optimization algorithm and its activation functions have an active range between -1 and 1. After applying data normalization all features have values in the range [-1; 1].
In the next article we will go through the process of building a neural network model for predicting crypto prices, tuning the network’s hyperparameters and evaluating it’s prediction accuracy.