This is a simple multi-factor model to estimate the next 1-3 days marekt price based on neural networks.
It is 43 factors, plus sentiment analysis bias.
The price estimator models are implemented using Tensor-Flow, while the Sentiment Analysis rely on
pretrained Distil-Bert fined tuned on FinGPT dataset from HuggingFace.
fig.1
There are two basics model based on Dense layer, which use few factors:
- Sequential Model 1 (price) Stock: this uses only prices and volumes.
- Sequential Model 1 (price) Stock and Rates: this uses prices, volumes and rates.
Those are just internal and not really interesting. They are simple multilayer perceptron Then there are two other models that can be used to actual price estimate:
- Sequential 1 (price) Stock Multifactor, plus a 3 prices estimate version: this is the model describe in the opening fig. 1.
- Transformer 1 (price) Stock Multifactor : this is a model based on trasformers and the same input of the previous one
Model #3 can also be run using LSTM in place of RNN, but tests show RNN produce closer to actual results.
The data can be fetched trough on python script that takes care of
- Stocks last close price (SPX, NASDAQ and NYSE) - Yahoo Finiance
- US treasury yield crv tenors: '1 Mo','2 Mo','3 Mo','4 Mo','6 Mo','1 Yr','2 Yr','3 Yr','5 Yr','7 Yr','10 Yr','20 Yr','30 Yr - treasury.gov
- Feed & Greed index value - dataviz.cnn.io
- GICS subsector index prices - Yahoo Finanace
Sentiment data, i.e. news are fetched throug web scarpers by using BeautifulSoup and Selenium. There are two sources
- news from yahoo finance news
- news from investing.com news
The news are fetched and interpreted by the sentiment model engine that produce a score store in a file that must be copied in the
same data folder where the price and the other market data are stored.
See next section on how to configure the folders.
The folder configuration is set in the config.json file :
{
"FOLDER_MARKET_DATA": "/Volumes/data/",
"FOLDER_REPORD_PDF": "/Volumes/reports/"
}
These entries specifies where to get the data and where to store the output report. The data location must be accessible from every Dask worker (e.g.: a shared folder on a local network).
The report folder has to be accessible only from the node the task is launched.
Make sure to copy static_tickers.list.csv into your FOLDER_MARKET_DATA.
The Sequential 1 (price) Stock Multifactor is the only one you can use as real estimator by using price_estimator.py. Even if you can modify the code to use other models. To trigger the report generation type:
python price_estimator.py
This is based on Dask. So you have to have run a dask scheduler and at least a worker on the same machine. The report generator create 50 scenarios (i.e. calibration/training) for each ticker in the ticker.json file. So it could be very expensive computation. It is highly recomended to have at least 8 workers having 4 GiB each. It is not necesssary to have more than 1 thread per worker.
All the other models are accessible by the back_tester tool (see "Tool" section).
Sentiment analysis is based on market news fetched from Yahoo Finance and Investing.com. This is for demostration poprpouses and
no commercial use or distribution of the data is allowed.
The sentiment model engine is based on Distil-BERT from HuggingFace that can be trained on
two different datasets:
Eengel7 is very small dataset and can be used to quickly fine tune Distil-BERT and test it. For better score FinGPT is more appropriate but the fine tuning might take several hours.
Sentiment model is not exposed as a tool but there are test-units in sentiment_model.py.
Throgh those test units you can fine tune, by removing the comment to:
def test_fine_tune(self)
or start generating your own sentiments scores by
def test_score_yahoo(self)
def test_score_investing(self)
It is straightforward, looking at these methods to add additional soruce. Also adding additional fine tune dataset should be quite easy. Every contribution here is welcome. The better and more data we have the better the sentiment analysis is.
the file config.json contains few info about the file location and the file naming:
This is an example of a Sequential model mergin a RNN for historical factors and a Dense layer that combine the output of the RNN with the sentiment analysis data
you can run it by the following command
user@myserver path
$ python back_tester.py IONQ 5 S1SMF
Where IONQ
is the ticker, 5
is the number of scenarios, S1DMF
is the model name. To check the list of the model name
look at const_and_utils.py
This is an example of a Sequential model mergin a LSTM for historical factors and a Dense layer that combine the output of the RNN with the sentiment analysis data
This is an example of a Transformer getting as input both historical factors and sentiment historical data.
As you can see RNN + Dense model performs better than the LSTM and Transformer one. This is also confirmed in this paper.
LSTM show a consistent underestimation in this example, while Transformes are more affected by the typcal 1-day shift on the historical series estimation.
This effect is quite frequent as the previous day value is a good esitimator of the current day value ( meaning the correlation betwenn the current day price and
the yesterday price si very high).
Sentiment Analysis and rates change have the purpouse to reduce the similarity with the previous day price that mis induced as (over-fitting) effect of the training.
The back tester across many strocks shows overall good performance as in most of the case the probabilty to get an error less than 5% is greater than 80%.
As showed in this report. Here below a simple page from it:
The report has been built trainig the model multiple times. Each curve show the probabilty density for each calibration. It easy to see there are good calibrations and less good calibrations.
To run the report you can can use the following command
user@myserver path
$ python back_tester.py ticker.json 5 S1SMF
This tool compare the estimate of day t for t+1 with the actuals and offer an overall evaulation of the estimate done on day t.
The tool assume on both day t and t+1 the price_estimator report has been generated. Additionally the ticker set might change between t and t+1
so the tool check only the tickers that are evaluated both at t and t+1.
The tool gets in input the day t, e.g.:
user@myserver path
$ python quality_checker.py 2024-01-12
and returns a chart that can be zoomed in and out of actual vs estiamte pluse the 1,2,3 sigma areas. Additionally return the ditribution of the difference between
the estimate return and the real return on %. See picture below: