The LSTM model serves as the primary forecasting tool, leveraging its ability to capture long-term dependencies in sequential data. However, recognizing that even sophisticated models like LSTM can have prediction biases, an ARIMA model is employed to estimate and correct these errors. By doing so, the system harnesses the strengths of both models: LSTM's deep learning capabilities for handling complex patterns and ARIMA's effectiveness in modeling time series data.
The repository includes a detailed script that outlines the entire process, from data loading and preprocessing to model training and evaluation. The data_loader function sets the stage, preparing the dataset for analysis. It's followed by a series of plotting functions that visualize various aspects of the data, such as raw time series, training versus testing sets, and prediction errors.
The LSTM model's architecture is defined with several layers, including LSTM and Dense layers, and the model is trained using the historical closing prices of financial assets. After training, the model's predictions are plotted against the actual values to visualize the performance.
The ARIMA model then steps in to calculate the error of the LSTM's predictions. These error estimates are subsequently used to adjust the LSTM predictions, resulting in a final, corrected output. This final prediction is believed to be more accurate and is visualized alongside the actual data for evaluation.
Performance metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) are calculated to quantify the accuracy of the models. The repository captures these metrics in a structured format, allowing for clear interpretation of the model's effectiveness.
Time-series data is a sequential collection of data points recorded at specific time intervals. In financial markets, time-series data primarily consists of stock prices, trading volumes, and various financial indicators gathered at regular time intervals. The significance of time-series data lies in its chronological order, a fundamental aspect that enables the identification of trends, cycles, and patterns critical for forecasting.
The inception of Long Short-Term Memory (LSTM) networks marked a pivotal advancement in the field of sequential data analysis. These networks, a specialised evolution of Recurrent Neural Network (RNN) architectures, emerged to address the challenge of preserving information over extended sequences – a hurdle where traditional RNNs faltered due to the vanishing gradient dilemma. LSTMs were ingeniously crafted to retain critical data across long intervals, ensuring that pivotal past information influences future decisions.
In my program, I have utilised TensorFlow to construct and train an LSTM-based model for a specific task, likely related to time series forecasting. Let's break down how each of the LSTM components corresponds to my program:
In my program, the memory cell is represented implicitly by the LSTM layer I've added using tf.keras.layers.LSTM(number_nodes, input_shape=(n, 1)). This LSTM layer acts as the memory cell of the network. The memory cell's purpose in my program is to capture and retain information over extended sequences. It is responsible for learning and remembering patterns and dependencies in the input time series data (middle_data) over time.
The input gate is a crucial part of an LSTM unit that regulates what information should be added to the memory cell. It uses a sigmoid function to control the flow of input information and employs a hyperbolic tangent (tanh) function to create a vector of values ranging from -1 to +1.
In my program, the input gate is implicitly implemented by the LSTM layer (tf.keras.layers.LSTM) within TensorFlow. The LSTM layer manages the flow of input information, determines what information should be stored in its cell state, and applies appropriate weightings using sigmoid and tanh functions.
The forget gate is responsible for deciding which information in the memory cell should be discarded. It employs a sigmoid function to assess the importance of each piece of information in the current memory state. In my program, the forget gate's functionality is automatically handled by the LSTM layer. It learns to decide which information from the previous memory state should be forgotten or retained based on the patterns and dependencies it identifies in the input data.
The output gate extracts valuable information from the memory cell to produce the final output. It combines the content of the memory cell with the input data, employing both tanh and sigmoid functions to regulate and filter the information before presenting it as the output. In my program, the output gate's operations are also encapsulated within the LSTM layer. It takes the current memory state and the input data to produce an output that is used for making predictions.
The Autoregressive Integrated Moving Average (ARIMA) model stands as a fundamental pillar within the realm of statistical time-series analysis. Its inception by Box and Jenkins in the early 1970s brought forth a powerful framework that amalgamates autoregressive (AR) and moving average (MA) elements, all while incorporating differencing to stabilise the time-series (the "I" in ARIMA). ARIMA models are celebrated for their simplicity and efficacy in modelling an extensive array of time-series data, notably for their proficiency in capturing linear relationships.
-
Error Mining with ARIMA : After LSTM's predictions, the program calls on ARIMA to refine these forecasts. The Error_Evaluation function comes into play here, extracting the difference between the predicted and actual prices—essentially capturing the LSTM's predictive shortcomings.
-
ARIMA's Calibration : With the error data in hand, the ARIMA_Model function is invoked, wielding the ARIMA model as a fine brush to paint over the imperfections of the LSTM's initial output. The ARIMA model is trained on these residuals, learning to anticipate the LSTM's prediction patterns and, more importantly, its prediction errors.
-
Synthesis of Predictions : The Final_Predictions function represents the judgement of the program's operations. It does not merely output raw predictions but synthesises the LSTM's foresight with ARIMA's insights, producing a final prediction that encapsulates the strengths of both models.
The integration of LSTM and ARIMA models presents a compelling hybrid approach to time-series forecasting. This methodology draws on the strengths of both models: LSTMs are capable of capturing complex non-linear patterns, while ARIMA excels at modelling the linear aspects of a time-series. By combining these two, one can potentially mitigate their individual weaknesses and enhance the overall predictive power.
Upon integrating LSTM and ARIMA, the model becomes robust against the volatility and unpredictability of financial time-series data. The predictions from the LSTM can be refined by the ARIMA model's error correction mechanism, which adds another layer of sophistication to the forecasts.
The predictions from LSTM, the hybrid LSTM+ARIMA model, and the actual values, several insights emerge. The LSTM model may capture the momentum and direction of stock prices effectively, but it might struggle with precision due to its sensitivity to recent data. The ARIMA model, conversely, may lag in capturing sudden market shifts but provides a smoothed forecast that averages out noise.
The hybrid model aims to balance these aspects. The LSTM component may anticipate a trend based on recent patterns, and the ARIMA part can adjust this forecast by considering the broader historical context. The final predictions, ideally, are more aligned with the actual values than either model could achieve on its own.
Implementation of the Program :
Purpose:
The data_loader
function is designed to load financial time-series data from a CSV file and prepare it as a DataFrame formatted for time series analysis.
Input:
The function takes no parameters but relies on a globally defined Filename_address
variable that contains the path to the CSV file.
Processing Elements:
- Pandas Library: Utilized for its powerful data manipulation capabilities, particularly for reading CSV files and handling time series data.
- Global Variables: It uses the
Filename_address
to locate the CSV file. - DataFrame Operations:
pd.read_csv
: Reads the CSV file into a DataFrame, with the 'Date' column set as the index and parsed as datetime objects for time series analysis.dropna
: Removes any rows with missing values to ensure the integrity of the time series data.
Output:
The function returns a DataFrame
object containing the clean, time-indexed financial data.
Function data_loader
Define column names as ["Open", "High", "Low", "Close", "Adj_Close", "Volume"]
Load CSV file from 'Filename_address' into a DataFrame with 'Date' as index
Set DataFrame columns to the defined column names
Drop any rows with missing values
Print the shape of the DataFrame
Print the first few rows of the DataFrame
Return the cleaned DataFrame
EndFunction
- Initialize the column names for the financial data.
- Use the Pandas function
read_csv
to read the data from the CSV file specified by theFilename_address
. - Set the index of the DataFrame to the 'Date' column, which is parsed as datetime.
- Assign the predefined column names to the DataFrame to maintain consistency.
- Remove any rows with missing data to ensure the data quality for subsequent analysis.
Purpose:
The plot_predictions
function is designed to visualize the actual vs. predicted financial time-series data. It generates a plot that overlays the predicted values over the actual values, allowing for a visual comparison.
Input:
train
: A pandas Series or DataFrame containing the actual values indexed by date.predictions
: A pandas Series or DataFrame containing the predicted values, expected to be of the same length and with the same index astrain
.title
: A string representing the title of the plot, which will also be used in naming the saved plot file.
Processing Elements:
- Matplotlib Library: Used for creating visualizations.
- Global Variables: Utilizes
Output_address
to determine the save path for the plot image.
Output:
- The function saves a .jpg image file of the plot to the location specified by
Output_address
with the giventitle
as its name. - No value is returned by the function.
Function plot_predictions with parameters: train, predictions, title
Initialize a new figure with specified dimensions (10x5 inches)
Plot the 'train' data with the index on the x-axis and values on the y-axis, labeled as 'Actual'
Plot the 'predictions' data on the same axes, labeled as 'Predicted' in red color
Set the title of the plot
Set the x-axis label as 'Date'
Set the y-axis label as 'Close-Price'
Concatenate the `Output_address` with the `title` and ".jpg" to form the file path
Save the figure to the file path
EndFunction
- Start by creating a new figure with the defined size.
- Plot the actual values (
train
) against their date index, labeling this line as 'Actual'. - Plot the predicted values (
predictions
) on the same plot, using a different color and labeling it 'Predicted'. - Assign the provided
title
to the plot. - Label the x-axis as 'Date' and the y-axis as 'Close-Price' to indicate what the axes represent.
- Combine the
Output_address
directory path with thetitle
of the plot to create the full file path for saving. - Save the figure as a .jpg file at the determined file path.
- The plot is now saved to the local file system, and the function terminates without returning any value.
Purpose:
The plot_train_test
function generates a plot to visualize the partition of financial time-series data into training and testing sets. This visual aid is important to verify the partitioning and observe the continuity and potential discrepancies between the train and test sets.
Input:
train
: A pandas Series or DataFrame containing the training set data, indexed by date.test
: A pandas Series or DataFrame containing the testing set data, indexed by date.
Processing Elements:
- Matplotlib Library: Used for creating and saving the plot.
- Global Variables: The function uses
Output_address
for determining where to save the output image.
Output:
- The function outputs a plot saved as a .jpg file to the location specified by
Output_address
. The plot displays the training and testing data series.
Function plot_train_test with parameters: train, test
Initialize a new figure with a size of 10x5 inches
Plot the 'train' series against its index with a label 'Train Set'
Plot the 'test' series against its index with a label 'Test Set' and set the color to orange
Set the title of the plot to 'Train and Test Data'
Set the x-axis label to 'Date'
Set the y-axis label to 'Close Price'
Concatenate `Output_address` with the filename ' Train and Test Data .jpg'
Save the figure to the specified address
EndFunction
- Begin by initiating a new figure for plotting with specified dimensions (10x5 inches).
- Plot the training dataset (
train
) on the figure, with dates on the x-axis and training data values on the y-axis, labeling it as 'Train Set'. - Plot the testing dataset (
test
) on the same figure, with dates on the x-axis and testing data values on the y-axis, labeling it as 'Test Set' and using a distinct orange color for differentiation. - Title the plot 'Train and Test Data' to describe the plotted data.
- Label the x-axis as 'Date' to indicate the time component and the y-axis as 'Close Price' to denote the financial metric plotted.
- Construct the file path for saving the plot by combining
Output_address
with the designated file name ' Train and Test Data .jpg'. - Save the plot to the constructed file path.
- The function concludes after saving the plot, and it does not return any values.
Purpose:
The plot_prediction_errors
function is used to visualize the errors over time between actual and predicted values in a time series forecasting model. This can help in identifying patterns or biases in the prediction errors.
Input:
errors
: A list or pandas Series containing the prediction errors, typically calculated as the difference between actual and predicted values.
Processing Elements:
- Matplotlib Library: This function utilizes Matplotlib to create and save a visualization plot of the prediction errors.
- Global Variables:
Output_address
is used to determine where the plot image will be saved.
Output:
- The function saves a .jpg file of the error plot to the directory specified by
Output_address
.
Function plot_prediction_errors with parameter: errors
Initialize a new figure with a size of 10x5 inches
Plot 'errors' with labeling as 'Prediction Errors'
Set the title of the plot to 'Prediction Errors over Time'
Set the x-axis label to 'Time Step'
Set the y-axis label to 'Error'
Create a legend for the plot
Form the save address by concatenating `Output_address` with ' Prediction Errors over Time .jpg'
Save the figure to the address
EndFunction
- Initiate a new figure with the specified dimensions for the plot.
- Plot the errors provided by the
errors
parameter against their corresponding time step. - Title the plot 'Prediction Errors over Time' to accurately reflect the data being visualized.
- Label the x-axis as 'Time Step' to represent the sequential nature of the data points.
- Label the y-axis as 'Error' to represent the magnitude of the prediction errors.
- Add a legend to the plot for clarity, which describes the data series plotted.
- Construct the full file path where the plot will be saved by appending ' Prediction Errors over Time .jpg' to the
Output_address
. - Save the plot to the specified file path.
- The function completes its execution after the plot is saved, without returning any values.
Purpose:
plot_final_predictions
is designed to create a visualization comparing the actual values from the test dataset with the final corrected predictions. This helps to assess the accuracy and effectiveness of the error correction applied to the predictive model.
Input:
test
: A pandas Series or DataFrame containing the test set data, indexed by date.final_predictions
: A pandas Series or DataFrame of the same length and with the same index astest
containing the final predictions after error correction.
Processing Elements:
- Matplotlib Library: It is utilized for plotting and saving the comparison plot.
- Global Variables: The function requires
Output_address
to define the path where the plot image will be saved.
Output:
- The function outputs a plot saved as a .jpg file to the location determined by
Output_address
. The plot displays the actual values and the corrected predictions.
Function plot_final_predictions with parameters: test, final_predictions
Initialize a new figure with a size of 10x5 inches
Plot the 'test' series against its index with a label 'Actual'
Plot the 'final_predictions' series against the same index with a label 'Corrected Prediction' in green color
Set the title of the plot to 'Final Predictions with Error Correction'
Set the x-axis label to 'Date'
Set the y-axis label to 'Close Price'
Create a legend for the plot
Form the save address by concatenating `Output_address` with the file name ' Final Predictions with Error Correction .jpg'
Save the figure to the constructed address
EndFunction
- Begin by initiating a new plotting figure with the given dimensions.
- Plot the actual test data (
test
) with the date index on the x-axis and close prices on the y-axis, labeled as 'Actual'. - Plot the final corrected predictions (
final_predictions
) on the same axes, labeling it as 'Corrected Prediction' and using green color for distinction. - Title the plot 'Final Predictions with Error Correction' to describe its purpose.
- Label the x-axis 'Date' and the y-axis 'Close Price' to indicate what the plot represents.
- Add a legend to the plot to identify the data series.
- Construct the file path for saving the plot by combining
Output_address
with the file name ' Final Predictions with Error Correction .jpg'. - Save the plot to the determined file path.
- The function concludes after saving the plot, and it does not return any value.
Purpose:
The plot_accuracy
function generates a bar chart to visually represent the accuracy metrics of a predictive model. These metrics typically include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
Input:
mse
: A numerical value representing the Mean Squared Error.rmse
: A numerical value representing the Root Mean Squared Error.mae
: A numerical value representing the Mean Absolute Error.
Processing Elements:
- Matplotlib Library: Used for plotting and saving the accuracy metrics as a bar chart.
- Global Variables: The function uses
Output_address
to determine the directory path where the plot image will be saved.
Output:
- The function outputs a bar chart saved as a .jpg file to the directory specified by
Output_address
.
Function plot_accuracy with parameters: mse, rmse, mae
Define a list 'metrics' with the values 'MSE', 'RMSE', 'MAE'
Define a list 'values' with the input parameters mse, rmse, mae
Initialize a new figure with a size of 10x5 inches
Plot a bar chart with 'metrics' as the x-axis and 'values' as the heights of the bars
Assign different colors to each bar for distinction
Set the title of the plot to 'Model Accuracy Metrics'
Form the save address by concatenating `Output_address` with the file name ' Model Accuracy Metrics .jpg'
Save the figure to the specified address
EndFunction
- Define the names of the metrics to be plotted (MSE, RMSE, MAE) in a list.
- Gather the provided accuracy metric values into a list corresponding to the metric names.
- Initialize a new plotting figure with predetermined dimensions (10x5 inches).
- Create a bar chart with the metric names on the x-axis and their corresponding values as the heights of the bars, with each bar colored differently for easy distinction.
- Title the plot 'Model Accuracy Metrics' to clearly indicate what the chart represents.
- Determine the file path for saving the plot by appending ' Model Accuracy Metrics .jpg' to the
Output_address
. - Save the bar chart to the constructed file path.
- The function ends after the bar chart is saved and does not return any values.
Purpose:
The plot_arima_accuracy
function visualizes the accuracy metrics specific to an ARIMA model using a bar chart. This visualization assists in the evaluation of the model's performance by representing Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) as bar heights.
Input:
mse
: A numeric value indicating the Mean Squared Error.rmse
: A numeric value indicating the Root Mean Squared Error.mae
: A numeric value indicating the Mean Absolute Error.
Processing Elements:
- Matplotlib Library: Employs matplotlib to create and save a bar chart.
- Global Variables: The function utilizes
Output_address
for the path where the bar chart will be saved.
Output:
- This function outputs a bar chart saved as a .jpg file in the directory specified by
Output_address
.
Function plot_arima_accuracy with parameters: mse, rmse, mae
Define a list 'metrics' with elements 'MSE', 'RMSE', 'MAE'
Define a list 'values' with the input parameters mse, rmse, mae
Initialize a new figure with dimensions of 10 by 5 inches
Create a bar chart with 'metrics' on the x-axis and 'values' as the bar heights
Assign specific colors to each bar (blue for MSE, orange for RMSE, green for MAE)
Set the chart title to 'ARIMA Model Accuracy Metrics'
Determine the save address by concatenating `Output_address` with ' Model Accuracy Metrics .jpg'
Save the figure to the defined address
EndFunction
- Initialize a list called
metrics
with the names of the accuracy metrics to be displayed. - Create a list called
values
containing the values of MSE, RMSE, and MAE passed to the function. - Begin a new plot with a figure size set to 10x5 inches.
- Plot a bar chart where the x-axis contains the metric names from
metrics
and the y-axis corresponds to their respective values fromvalues
. - Assign a distinct color to each bar to visually differentiate between the metrics.
- Title the plot 'ARIMA Model Accuracy Metrics' to clearly convey the plot's focus.
- Formulate the full file path for saving the chart by appending ' Model Accuracy Metrics .jpg' to the
Output_address
. - Save the bar chart to the file path that was created.
- The function terminates after the plot is saved, without returning any value.
Purpose:
The data_allocation
function is tasked with partitioning a given dataset into training and testing sets for model development and evaluation. This split is essential for assessing the model's performance on unseen data.
Input:
data
: A pandas DataFrame that contains the time series data with one of the columns beingclose
, representing the closing price which is typically used in financial time series forecasting.
Processing Elements:
- Global Variables:
days
: The number of entries from the end of the dataset to be allocated to the test set.close
: A string that denotes the column name for the closing prices in thedata
DataFrame.
Output:
train
: A pandas Series or DataFrame containing the training set data.test
: A pandas Series or DataFrame containing the testing set data.
Function data_allocation with parameter: data
Calculate train_len_val by subtracting the number of days (global variable) from the length of the data
Split the 'data' into 'train' and 'test' sets by slicing:
'train' contains all entries from start up to train_len_val
'test' contains all entries from train_len_val to the end
Print the training set and its size
Print the testing set and its size
Return the 'train' and 'test' sets
EndFunction
- Determine the length of the training set by subtracting the global variable
days
from the total length of the dataset. - Allocate the first segment of the dataset up to the determined length to the training set.
- Allocate the remaining segment from the determined length to the end of the dataset to the testing set.
- Print a descriptive message followed by the training set and its size to provide an immediate visual confirmation of the data partitioning.
- Print a descriptive message followed by the testing set and its size for the same reasons as above.
- Return both the training set and the testing set to be used in subsequent stages of the model development and evaluation process.
Purpose:
The apply_transform
function is designed to transform time series data into a format suitable for training LSTM (Long Short-Term Memory) networks. The transformation involves creating sequences of n
previous data points (lags) to predict the next value.
Input:
data
: A pandas Series or numpy array containing the time series data.n
: An integer that defines the number of lags, i.e., the size of the input sequence for the LSTM model.
Processing Elements:
- NumPy Library: Used for numerical operations and to transform the list of sequences into a numpy array suitable for the LSTM input.
- List Comprehension: Constructs the sequences of lags (input data) and the target values (what the model will learn to predict).
Output:
middle_data
: A numpy array of shape(number of sequences, n, 1)
, where each sequence is a sliding window ofn
lagged values from thedata
.target_data
: A numpy array containing the target values corresponding to each sequence inmiddle_data
.
Function apply_transform with parameters: data, n
Initialize an empty list called 'middle_data'
Initialize an empty list called 'target_data'
Loop over the data starting from index n to the end of the data:
Extract a sequence of 'n' values from 'data' ending at the current index
Append the sequence to 'middle_data'
Append the current value of 'data' to 'target_data'
Convert 'middle_data' into a numpy array and reshape it to (len(middle_data), n, 1)
Convert 'target_data' into a numpy array
Return 'middle_data' and 'target_data'
EndFunction
- Initialize two empty lists:
middle_data
for storing the input sequences andtarget_data
for the corresponding target values. - Iterate over the
data
series starting from then
th element to the end. - For each iteration, extract a sequence of
n
values from thedata
series leading up to the current index and append this sequence tomiddle_data
. - Append the value at the current index of the
data
series totarget_data
as the target value for the previously extracted sequence. - After the loop, convert
middle_data
into a numpy array and reshape it to have the dimensions suitable for LSTM input, which is(number of sequences, n, 1)
. - Convert
target_data
into a numpy array without reshaping since it represents the target values. - Return the
middle_data
andtarget_data
arrays for use in training the LSTM model.
Purpose:
The LSTM
function builds, compiles, and trains a Long Short-Term Memory (LSTM) neural network model using the provided time series training data. The model aims to predict future values in the series based on the input sequences of historical data.
Input:
train
: A pandas Series or numpy array containing the time series training data.n
: An integer defining the number of lagged data points to use as input for the LSTM model.number_nodes
: The number of neurons in each LSTM and Dense layer of the neural network.learning_rate
: The learning rate for the optimizer during training.epochs
: The number of epochs to train the model.batch_size
: The number of samples per gradient update during training.
Processing Elements:
- TensorFlow and Keras: Utilized for creating the LSTM model, compiling it, and fitting it to the training data.
- apply_transform Function: Called to transform the training data into sequences suitable for LSTM input.
- Sequential Model API: Used for stacking layers to build the LSTM model.
- Adam Optimizer: An algorithm for first-order gradient-based optimization of stochastic objective functions.
Output:
model
: The trained Keras Sequential LSTM model.history
: A record of training loss and accuracy values at successive epochs.full_predictions
: The model's predictions for the input data used during training.
Function LSTM with parameters: train, n, number_nodes, learning_rate, epochs, batch_size
Transform 'train' data into sequences and targets using apply_transform function
Initialize a Sequential LSTM model
Add Input layer with shape (n,1)
Add LSTM layer with 'number_nodes' neurons
Add two Dense layers each with 'number_nodes' neurons and 'relu' activation
Add a Dense output layer with a single neuron
Compile the model with 'mse' loss function, Adam optimizer with 'learning_rate', and 'mean_absolute_error' metric
Fit the model to 'middle_data' and 'target_data' for 'epochs' with 'batch_size', without verbosity
Predict on 'middle_data' to obtain full predictions
Return the model, training history, and full predictions
EndFunction
- Call
apply_transform
with the training datatrain
and the lag valuen
to prepare the input and target data for the LSTM. - Define the LSTM model architecture using the Sequential API from Keras with an input layer, LSTM layer, two dense layers, and an output layer.
- Compile the LSTM model with the mean squared error loss function, Adam optimizer with the specified learning rate, and mean absolute error as a performance metric.
- Train the model on the transformed data for the given number of epochs and batch size.
- After training, use the model to predict on the input data to get the full set of predictions.
- Output the trained model, the history of its performance over the epochs, and the full predictions array.
Purpose:
The function calculate_accuracy
computes common statistical accuracy metrics to evaluate the performance of regression models, specifically Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE).
Input:
true_values
: An array-like structure, typically a numpy array or pandas Series, that contains the actual observed values.predictions
: An array-like structure with the predicted values, expected to be of the same length astrue_values
.
Processing Elements:
- Mean Squared Error (MSE): This metric measures the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value.
- Root Mean Squared Error (RMSE): It is the square root of the MSE and measures the standard deviation of the residuals.
- Mean Absolute Error (MAE): This metric measures the average magnitude of the errors in a set of predictions, without considering their direction.
Output:
mse
: A float representing the Mean Squared Error.rmse
: A float representing the Root Mean Squared Error.mae
: A float representing the Mean Absolute Error.
Function calculate_accuracy with parameters: true_values, predictions
Calculate MSE by taking the mean of the squared differences between true_values and predictions
Calculate RMSE by taking the square root of MSE
Calculate MAE by taking the mean of the absolute differences between true_values and predictions
Return mse, rmse, mae
EndFunction
- Utilize the
mean_squared_error
function from sklearn.metrics to calculate the MSE between thetrue_values
andpredictions
. - Compute the RMSE by taking the square root of the MSE using numpy's
sqrt
function. - Calculate the MAE using the
mean_absolute_error
function from sklearn.metrics. - Return the computed values of MSE, RMSE, and MAE to be used as accuracy metrics for the model evaluation.
Purpose:
The Error_Evaluation
function is designed to calculate the errors between the actual training data and the predictions made by the LSTM model. This can be used for further analysis of the model's performance and error correction.
Input:
train_data
: A pandas Series or numpy array containing the actual observed training values.predict_train_data
: A pandas Series or numpy array containing the predicted values obtained from the LSTM model, expected to be of the same length astrain_data
after accounting for the lagn
.n
: An integer representing the number of lagged observations used in the LSTM model (the size of the input sequence).
Processing Elements:
- List Comprehension: Iterates through the predicted data to compute the difference with the actual data, point by point.
Output:
errors
: A list of error values representing the difference between the actual and predicted values.
Function Error_Evaluation with parameters: train_data, predict_train_data, n
Initialize an empty list called 'errors'
Loop through the indices of predict_train_data:
Calculate the error at each point as the difference between the actual value (train_data at index n+i) and the predicted value (predict_train_data at index i)
Append the error to the 'errors' list
Return the 'errors' list
EndFunction
- Initialize an empty list to store the error values.
- Iterate over the predicted training data.
- For each predicted value, calculate the error by subtracting the predicted value from the actual value (considering the lag
n
). - Store each error value in the list.
- Return the complete list of errors after the iteration is finished. This list can be used to analyze the distribution and pattern of errors made by the model during training.
Purpose:
The Parameter_calculation
function aims to determine the optimal parameters for an ARIMA (Autoregressive Integrated Moving Average) model using the given time series data. It also generates plots for the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF), which are helpful for identifying the ARIMA model's parameters.
Input:
data
: A pandas Series or numpy array containing the time series data.
Processing Elements:
- auto_arima from pmdarima: This is a function that automates the process of ARIMA modeling, including the selection of optimal parameters.
- plot_acf from statsmodels: Generates an ACF plot, which is used to identify the number of MA (Moving Average) terms.
- plot_pacf from statsmodels: Generates a PACF plot, which is used to identify the number of AR (Autoregressive) terms.
- Global Variables:
lag
: Used to set the number of lags in the ACF and PACF plots.Output_address
: Used to specify the directory path where the ACF and PACF plot images will be saved.
Output:
ord
: A tuple representing the order of the ARIMA model, which consists of (p, d, q) parameters where 'p' is the number of AR terms, 'd' is the degree of differencing, and 'q' is the number of MA terms.
Function Parameter_calculation with parameter: data
Run auto_arima on 'data' with tracing enabled to find optimal parameters
Plot the ACF of 'data' using the global 'lag' variable
Save the ACF plot to the 'Output_address' directory with the filename "ACF.jpg"
Plot the PACF of 'data' using the global 'lag' variable
Save the PACF plot to the 'Output_address' directory with the filename "PACF.jpg"
Extract the order (p, d, q) of the ARIMA model from the findings of auto_arima
Return the order of the ARIMA model
EndFunction
- Execute the
auto_arima
function on the inputdata
to automatically determine the best-fitting ARIMA model parameters while printing the trace of the fitting process. - Plot the ACF for the given
data
up to the number of lags specified bylag
. - Save the ACF plot to the specified
Output_address
directory with the appropriate filename. - Plot the PACF for the given
data
up to the number of lags specified bylag
. - Save the PACF plot to the specified
Output_address
directory with the appropriate filename. - Retrieve the order of the ARIMA model (p, d, q) from the results of the
auto_arima
function. - Return the ARIMA model order for use in subsequent model fitting.
Purpose:
The ARIMA_Model
function fits an ARIMA model to the training data and uses it to make predictions. The primary use in this context is to forecast the potential errors from an LSTM model, which can then be used for error correction in the LSTM's predictions.
Input:
train
: A pandas Series or numpy array containing the training set data used to fit the ARIMA model.len_test
: An integer representing the length of the test dataset, which dictates how many future steps to predict.ord
: A tuple indicating the order of the ARIMA model, typically obtained from theParameter_calculation
function, which consists of (p, d, q) parameters.
Processing Elements:
- ARIMA from statsmodels: A class that represents an ARIMA model, used here for time series forecasting.
- Fitting the Model: The ARIMA model is fitted to the training data using the provided order parameters.
- Predictions: The model is used to make predictions for the specified future time steps.
Output:
model
: The fitted ARIMA model object.predictions
: The forecasts from the model starting from the end of the training set to the length of the test set.full_predictions
: The full set of in-sample predictions for the training data.
Function ARIMA_Model with parameters: train, len_test, ord
Initialize an ARIMA model with 'train' data and 'ord' order
Fit the ARIMA model to the 'train' data
Make predictions from the end of 'train' data up to the length of the test set plus one
Make full in-sample predictions for the 'train' data
Return the fitted model, out-of-sample predictions, and in-sample predictions
EndFunction
- Instantiate an ARIMA model with the training data
train
and the order parametersord
. - Fit the model to the training data using the
fit()
method. - Use the
predict
method of the fitted model to forecast future values for a range starting at the end of the training set and extendinglen_test
steps into the future. - Also, generate a full set of in-sample predictions for the training data, which covers the entire range of the training set.
- Return the fitted ARIMA model, the out-of-sample predictions for error correction, and the in-sample predictions for evaluation purposes.
Purpose:
The Final_Predictions
function calculates the final forecasted values by adjusting the LSTM model predictions with the ARIMA model-predicted errors. This technique is often used in hybrid models to correct predictions from one model using insights from another.
Input:
predictions_errors
: A list or pandas Series containing the errors between the actual values and the LSTM model's predictions, as forecasted by the ARIMA model.predictions
: A list or pandas Series containing the LSTM model's predictions.
Processing Elements:
- List Iteration: A loop that runs through the number of
days
(a globally set variable), combining the predictions from the LSTM model and the errors predicted by the ARIMA model.
Output:
final_values
: A list of the corrected predictions after accounting for the ARIMA-predicted errors.
Function Final_Predictions with parameters: predictions_errors, predictions
Initialize an empty list 'final_values'
Loop over the range of 'days' (global variable):
Calculate the final value by adding the prediction error to the LSTM prediction at each index
Append the final value to 'final_values'
Return 'final_values'
EndFunction
- Start by creating an empty list
final_values
to store the adjusted predictions. - Loop through a range of indices defined by the global variable
days
, which determines how many final predictions to calculate. - At each iteration, add the corresponding prediction error from
predictions_errors
to the LSTM prediction frompredictions
and append the result tofinal_values
. - After the loop completes, return
final_values
, which contains the final adjusted predictions.
Purpose:
The main
function orchestrates the entire process of loading data, preparing it, training the LSTM model, making predictions, evaluating errors, and generating various plots and outputs. It serves as the entry point for running the time series forecasting program.
Input:
There are no direct inputs to the main
function as it stands alone. It relies on global variables and the functions it calls to operate on the data.
Processing Elements:
- Data loading and plotting functions:
data_loader
,plot_raw_data
- Data partitioning function:
data_allocation
- Model training and prediction functions:
LSTM
,Error_Evaluation
,Parameter_calculation
,ARIMA_Model
,Final_Predictions
- Accuracy calculation functions:
calculate_accuracy
- Plotting accuracy and errors:
plot_train_test
,plot_predictions
,plot_prediction_errors
,plot_final_predictions
,plot_accuracy
,plot_arima_accuracy
- File writing: Outputs model summaries and predictions to a text file.
Output:
The main
function does not return any value. Its outputs are:
- Plots saved as images in the specified output directory.
- Console prints of model summaries and accuracy metrics.
- A text file saved with detailed model information and predictions.
Function main
Load data using data_loader function
Plot raw data using plot_raw_data function
Partition data into training and testing sets using data_allocation function
Plot training and testing data using plot_train_test function
Start timing the LSTM model process
Train LSTM model using LSTM function
Plot LSTM predictions using plot_predictions function
Make new predictions using the trained LSTM model
Evaluate errors in LSTM predictions using Error_Evaluation function
Plot prediction errors using plot_prediction_errors function
Calculate accuracy of LSTM predictions using calculate_accuracy function
Plot LSTM accuracy using plot_accuracy function
Determine ARIMA model parameters using Parameter_calculation function
Fit ARIMA model and make predictions on errors using ARIMA_Model function
Calculate ARIMA model accuracy and plot it using plot_arima_accuracy function
Calculate final predictions by combining LSTM predictions and ARIMA predicted errors using Final_Predictions function
Plot final predictions using plot_final_predictions function
Write LSTM and ARIMA model details, predictions, and accuracies to an output text file
Print the time taken for the entire process
EndFunction
Call main function if the script is the main program
- Call
data_loader
to load the dataset. - Call
plot_raw_data
to visualize the raw dataset. - Call
data_allocation
to split the data into training and testing sets. - Call
plot_train_test
to visualize the training and testing datasets. - Train the LSTM model by calling
LSTM
and plot its predictions. - Generate predictions for the test set using the trained LSTM model.
- Evaluate prediction errors with
Error_Evaluation
and visualize them. - Calculate and print the LSTM model's accuracy, plotting the results.
- Determine the best ARIMA model parameters and fit the ARIMA model to predict errors.
- Plot ARIMA model accuracy.
- Combine LSTM predictions with ARIMA-predicted errors using
Final_Predictions
and visualize the final predictions. - Write all relevant outputs, including model summaries, accuracies, and predictions, to a text file.
- Print the total time taken for the process.
- Execute the
main
function if the script is run as the main program.