A financial analysis AI system by fine-tuning Llama-3-8B with LoRA on 10-K Q&A data, paired with a RAG pipeline that retrieves real-time SEC filings. The system processes Sections 1A (Risk Factors) and 7 (MD&A) using open-source embeddings (BAAI/bge-large-en) to provide contextually accurate answers to complex financial queries. Combines parameter-efficient LLM training with semantic search for scalable enterprise analysis.
This project combines:
- Fine-tuned a LLaMa-3 model meta-llama/Meta-Llama-3-8B-Instruct with LoRA adapters on 10-K Q&A data using Unsloth AI framework.
- An SEC data pipeline for retrieving 10-K filings in real-time using the SEC API.
- Local embeddings with BAAI/bge-large-en-v1.5 model for semantic search to provide contextually accurate answers to complex financial queries.
- In-memory vector storage for contextual retrieval of 10-K filings.
- A RAG pipeline to inject context into the LLM's inference process.
The system answers financial questions using relevant context from SEC 10-K reports (specifically Sections 1A and 7).
- 🦙 Parameter-efficient fine-tuning with LoRA adapters
- 📈 SEC API integration for real-time 10-K retrieval
- 🔍 Semantic search using open-source embeddings
- 💡 End-to-end RAG pipeline implementation
- Clone repository:
git clone https://github.com/mirabdullahyaser/LLaMA3-Financial-Analyst.git
cd llm-financial-analyst
- Install dependencies:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps xformers trl peft accelerate bitsandbytes
pip install sec_api
pip install -U langchain
pip install -U langchain-community
pip install -U sentence-transformers
pip install -U faiss-gpu-cu12
- Setup environment variables:
from google.colab import userdata
# HuggingFace token, required for accessing gated models (like LLaMa 3 8B Instruct)
hf_token = userdata.get("HUGGINGFACEHUB_API_KEY")
# SEC-API Key
sec_api_key = userdata.get("SEC_API_KEY")
- Initializing the LLaMa-3 model **meta-llama/Meta-Llama-3-8B-Instruct. We will be using the built in GPU on Colab to do all our fine tuning needs, using the Unsloth Library. Much of the below code is augmented from Unsloth Documentation!
# Load the model and tokenizer from the pre-trained FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
# Specify the pre-trained model to use
model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
# Specifies the maximum number of tokens (words or subwords) that the model can process in a single forward pass
max_seq_length = 2048,
# Data type for the model. None means auto-detection based on hardware, Float16 for specific hardware like Tesla T4
dtype = None,
# Enable 4-bit quantization, By quantizing the weights of the model to 4 bits instead of the usual 16 or 32 bits, the memory required to store these weights is significantly reduced. This allows larger models to be run on hardware with limited memory resources.
load_in_4bit = True,
# Access token for gated models, required for authentication to use models like Meta-Llama-2-7b-hf
token = hf_token,
)
- Adding LoRA adapters to the model for parameter-efficient fine-tuning. LoRA, or Low-Rank Adaptation, is a technique used in machine learning to fine-tune large models more efficiently. It works by adding a small, additional set of parameters to the existing model instead of retraining all the parameters from scratch. This makes the fine-tuning process faster and less resource-intensive. Essentially, LoRA helps tailor a pre-trained model to specific tasks or datasets without requiring extensive computational power or memory.
# Apply LoRA (Low-Rank Adaptation) adapters to the model for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
model,
# Rank of the adaptation matrix. Higher values can capture more complex patterns. Suggested values: 8, 16, 32, 64, 128
r = 16,
# Specify the model layers to which LoRA adapters should be applied
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
# Scaling factor for LoRA. Controls the weight of the adaptation. Typically a small positive integer
lora_alpha = 16,
# Dropout rate for LoRA. A value of 0 means no dropout, which is optimized for performance
lora_dropout = 0,
# Bias handling in LoRA. Setting to "none" is optimized for performance, but other options can be used
bias = "none",
# Enables gradient checkpointing to save memory during training. "unsloth" is optimized for very long contexts
use_gradient_checkpointing = "unsloth",
# Seed for random number generation to ensure reproducibility of results
random_state = 3407,
)
- Prepare the dataset for fine-tuning. We will be using a Hugging Face dataset of Financial Q&A over form 10ks, provided by user Virat Singh here https://huggingface.co/datasets/virattt/llama-3-8b-financialQA
The following code below formats the entries into the prompt defined first for training, being careful to add in special tokens. In this case our End of Sentence token is <|eot_id|>. More LLaMa 3 special tokens here
# Defining the expected prompt
ft_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Below is a user question, paired with retrieved context. Write a response that appropriately answers the question,
include specific details in your response. <|eot_id|>
<|start_header_id|>user<|end_header_id|>
### Question:
{}
### Context:
{}
<|eot_id|>
### Response: <|start_header_id|>assistant<|end_header_id|>
{}"""
# Grabbing end of sentence special token
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# Function for formatting above prompt with information from Financial QA dataset
def formatting_prompts_func(examples):
questions = examples["question"]
contexts = examples["context"]
responses = examples["answer"]
texts = []
for question, context, response in zip(questions, contexts, responses):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = ft_prompt.format(question, context, response) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass
dataset = load_dataset("virattt/llama-3-8b-financialQA", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)
- Defining the Trainer Arguments for fine-tuning. We will be setting up and using HuggingFace Transformer Reinforcement Learning (TRL)'s Supervised Fine-Tuning Trainer
Supervised fine-tuning is a process in machine learning where a pre-trained model is further trained on a specific dataset with labeled examples. During this process, the model learns to make predictions or classifications based on the labeled data, improving its performance on the specific task at hand. This technique leverages the general knowledge the model has already acquired during its initial training phase and adapts it to perform well on a more targeted set of examples. Supervised fine-tuning is commonly used to customize models for specific applications, such as sentiment analysis, object recognition, or language translation, by using task-specific annotated data.
trainer = SFTTrainer(
# The model to be fine-tuned
model = model,
# The tokenizer associated with the model
tokenizer = tokenizer,
# The dataset used for training
train_dataset = dataset,
# The field in the dataset containing the text data
dataset_text_field = "text",
# Maximum sequence length for the training data
max_seq_length = 2048,
# Number of processes to use for data loading
dataset_num_proc = 2,
# Whether to use sequence packing, which can speed up training for short sequences
packing = False,
args = TrainingArguments(
# Batch size per device during training
per_device_train_batch_size = 2,
# Number of gradient accumulation steps to perform before updating the model parameters
gradient_accumulation_steps = 4,
# Number of warmup steps for learning rate scheduler
warmup_steps = 5,
# Total number of training steps
max_steps = 60,
# Number of training epochs, can use this instead of max_steps, for this notebook its ~900 steps given the dataset
# num_train_epochs = 1,
# Learning rate for the optimizer
learning_rate = 2e-4,
# Use 16-bit floating point precision for training if bfloat16 is not supported
fp16 = not is_bfloat16_supported(),
# Use bfloat16 precision for training if supported
bf16 = is_bfloat16_supported(),
# Number of steps between logging events
logging_steps = 1,
# Optimizer to use (in this case, AdamW with 8-bit precision)
optim = "adamw_8bit",
# Weight decay to apply to the model parameters
weight_decay = 0.01,
# Type of learning rate scheduler to use
lr_scheduler_type = "linear",
# Seed for random number generation to ensure reproducibility
seed = 3407,
# Directory to save the output models and logs
output_dir = "outputs",
),
)
- Training the model
trainer_stats = trainer.train()
Now that we have our fine tuned language model, inference functions, and a desired prompt format, we need to now set up the RAG pipeline to inject the relevant context into each generation.
The flow will follow as such:
User Question -> Context Retrieval from 10-K -> LLM Answers User Question Using Context
To do this we will need to be able to:
- Gather specific from 10-K's
- Parse and chunk the text in them
- Vectorize and embed the chunks into a vector Database
- Set up a retriever to semantically search the user's questions over the database to return relevant context
A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission, that gives a comprehensive summary of a company's financial performance.
- Function For 10-K Retrieval. To do this easier, we're taking advantage of the SEC API. It is free to sign up, and you get 100 API calls a day to use, each time we load a ticker's symbol it will use 3 calls.
For this project, we'll be focused on loading only sections 1A and 7
- 1A: Risk Factors
- 7: Management's Discussion and Analysis of Financial Condition and Results of Operations
# Extract Filings Function
def get_filings(ticker):
global sec_api_key
# Finding Recent Filings with QueryAPI
queryApi = QueryApi(api_key=sec_api_key)
query = {
"query": f"ticker:{ticker} AND formType:\"10-K\"",
"from": "0",
"size": "1",
"sort": [{ "filedAt": { "order": "desc" } }]
}
filings = queryApi.get_filings(query)
# Getting 10-K URL
filing_url = filings["filings"][0]["linkToFilingDetails"]
# Extracting Text with ExtractorAPI
extractorApi = ExtractorApi(api_key=sec_api_key)
onea_text = extractorApi.get_section(filing_url, "1A", "text") # Section 1A - Risk Factors
seven_text = extractorApi.get_section(filing_url, "7", "text") # Section 7 - Management’s Discussion and Analysis of Financial Condition and Results of Operations
# Joining Texts
combined_text = onea_text + "\n\n" + seven_text
return combined_text
- Setting Up Embeddings Locally. In the spirit of local and fine tuned models, we'll be using an open source embedding model, Beijing Academy of Artificial Intelligence's - Large English Embedding Model. More details on their open source model available in their GitHub repo!
Embeddings are numerical representations of data, typically used to convert complex, high-dimensional data into a lower-dimensional space where similar data points are closer together. In the context of natural language processing (NLP), embeddings are used to represent words, phrases, or sentences as vectors of real numbers. These vectors capture semantic relationships, meaning that words with similar meanings are represented by vectors that are close together in the embedding space.
Embedding models are machine learning models that are trained to create these numerical representations. They learn to encode various types of data into embeddings that capture the essential characteristics and relationships within the data. For example, in NLP, embedding models like Word2Vec, GloVe, and BERT are trained on large text corpora to produce word embeddings. These embeddings can then be used for various downstream tasks, such as text classification, sentiment analysis, or machine translation. In this case we'll be using it for semantic similarity
# HF Model Path
modelPath = "BAAI/bge-large-en-v1.5"
# Create a dictionary with model configuration options, specifying to use the cuda for GPU optimization
model_kwargs = {'device':'cuda'}
encode_kwargs = {'normalize_embeddings': True}
# Initialize an instance of LangChain's HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
model_name=modelPath, # Provide the pre-trained model's path
model_kwargs=model_kwargs, # Pass the model configuration options
encode_kwargs=encode_kwargs # Pass the encoding options
)
- Processing and Defining Vector Database. In this flow we get the data from the above defined SEC API functions, and then go through Three steps:
- Text Splitting
- Vectorizing
- Retrieval Function Setup
Text splitting is the process of breaking down large documents or text data into smaller, manageable chunks. This is often necessary when dealing with extensive text data, such as legal documents, financial reports, or any lengthy articles. The purpose of text splitting is to ensure that the data can be effectively processed, analyzed, and indexed by machine learning models and databases.
Vector databases store data in the form of vectors, which are numerical representations of text, images, or other types of data. These vectors capture the semantic meaning of the data, allowing for efficient similarity search and retrieval.
The Vector DB we're using here is the Facebook AI Semantic Search library, a lightweight an in memory (don't need to save this to a disk) solution that is not as powerful as other Vector DB's but will work great for this use case
# Prompt the user to input the stock ticker they want to analyze
ticker = input("What Ticker Would you Like to Analyze? ex. AAPL: ")
print("-----")
print("Getting Filing Data")
# Retrieve the filing data for the specified ticker
filing_data = get_filings(ticker)
print("-----")
print("Initializing Vector Database")
# Initialize a text splitter to divide the filing data into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000, # Maximum size of each chunk
chunk_overlap = 500, # Number of characters to overlap between chunks
length_function = len, # Function to determine the length of the chunks
is_separator_regex = False # Whether the separator is a regex pattern
)
# Split the filing data into smaller, manageable chunks
split_data = text_splitter.create_documents([filing_data])
# Create a FAISS vector database from the split data using embeddings
db = FAISS.from_documents(split_data, embeddings)
# Create a retriever object to search within the vector database
retriever = db.as_retriever()
print("-----")
print("Filing Initialized")
- Retreival. It is the process of querying a vector database to find and return relevant text chunks or documents that match a given query. This involves searching through the indexed embeddings to identify the ones that are most similar to the query.
How It Works:
- Query Embedding: When a query is made, it is first converted into an embedding using the same embedding model used for the text chunks.
- Similarity Search: The retriever searches the vector database for embeddings that are similar to the query embedding. This similarity is often measured using distance metrics like cosine similarity or Euclidean distance.
- Document Retrieval: The retriever then retrieves the original text chunks or documents associated with the similar embeddings.
- Context Assembly: The retrieved text chunks are assembled to provide a coherent context or answer to the query.
In this function, the query is used to invoke the retriever, which returns a list of documents. The content of these documents is then extracted and returned as the context for the query.
# Retrieval Function
def retrieve_context(query):
global retriever
retrieved_docs = retriever.invoke(query) # Invoke the retriever with the query to get relevant documents
context = []
for doc in retrieved_docs:
context.append(doc.page_content) # Collect the content of each retrieved document
return context
- Combining Functions. Now, we'll string everything together into a very simple while loop that will take the user's question, retrieve context from the Vector DB populated with the specific company Form 10-K, then run inference through our fine tuned model to generate a response! Give it a shot
while True:
question = input(f"What would you like to know about {ticker}'s form 10-K? ")
if question == "x":
break
else:
context = retrieve_context(question) # Context Retrieval
resp = inference(question, context) # Running Inference
parsed_response = extract_response(resp) # Parsing Response
print(f"LLaMa3 Agent: {parsed_response}")
print("-----\n")
Example financial Q&A:
User Query: What region contributes most to international sales?
LLaMa3 Agent: Europe
User Query: What are significant announcements of products during fiscal year 2023?
LLaMa3 Agent: During fiscal year 2024, the Company announced the following significant products: MacBook Pro 14-in.
User Query: What are significant announcements of products during fiscal year 2023?
LLaMa3 Agent: Significant product announcements during fiscal year 2023 included the following: MacBook Pro 14-in.
- Dataset
- LLaMa 3
- Unsloth AI
- Supervised fine-tuning
- LLaMa 3 special tokens
- Beijing Academy of Artificial Intelligence's - Large English Embedding Model
Contributions welcome! Please open an issue first to discuss proposed changes.
For questions or suggestions, please contact Mir Abdullah Yaser via GitHub Issues or mirabdullahyaser@example.com.