In the constantly evolving field of machine learning, we now have the capability to carry out analyses and make predictions that were once thought impossible. Machine learning empowers us to identify relationships between dependent variables, forecast time-series data, and much more.
In this project, we delve into the domain of unstructured data interpretation, with a focus on data from various sources such as news articles and social media commentary. The objective is to leverage this unstructured data and apply it in the context of stock market predictions.
Our goal is to convert unstructured data from sources like social media and news into a feature for forecasting changes in stock opening prices. We utilized the VADER Sentiment Model from NLTK, fine-tuned with a finance phrase bank, to perform sentiment analysis. We also developed a web scraper to extract news within a specified timeframe and utilized random samples according to the Central Limit Theorem to represent the data. The sentiment analysis returned a compound sentiment score in the range of [-1, 1] inclusive. We experimented with two types of models, the Random Forest Regressor and the XGBoost Linear Regressor, focusing primarily on the results from the Random Forest model.
We tested three types of models: one using only opening price change data, one with sentiment included, and one with sentiment and sentiment volatility included. The results were as follows:
Model Type | R-Squared Score |
---|---|
Pure Data | 0.350 |
Sentiment Included | 0.382 |
Sentiment and Volatility | 0.409 |
Here are the prediction graphs for each model:
As evident from the graphs, the model incorporating sentiment and sentiment volatility offers a better fit. This suggests that converting unstructured data, such as news, into structured data that humans and machines can interpret, could potentially enhance tasks like stock prediction.
- Testing more advanced models designed to handle time-series data, such as LSTM, could yield improved results.
- The sentiment analysis could be enhanced by incorporating OpenAI's GPT-3.5 API, potentially providing a better-fitted sentiment score.
- As the VADER model lacks subjective opinions and only offers an out-of-the-box compound score without human tagging, it may misinterpret some articles and is incapable of classification. These limitations could be addressed in future iterations of the project.