Automate Stock Analysis With an Intelligent Research Portal
Automated due diligence to populate your discounted cash flow model using an intelligent file store.
Introduction
Investing is overwhelming. There are so many different data points and sources to consider when deciding who will generate a return on your hard-earned cash.
A wise investor told me to consider companies that I personally use. It made sense to research the very company that powers Mixpeek, Amazon Web Services. In doing so, I had to scour dozens of assets on their Investor Relations page between earnings calls, 10K documents, careers pages and more.
I will walk you through how I enriched my own Discounted Cash Flow Model for Amazon.com, Inc. using entirely off-the-shelf technologies.
My first data point of interest is understanding how their NLP products are performing, so I need to ask a question like:
"What was the OpenSearch Annual Contract Value from 2021?"
Just because the tools are free, doesn't mean the cost of ownership is. There is a managed alternative
Download the Files
There are three different file types we need in order to conduct our due diligence:
- Amazon-Quarterly-Earnings-Call.wav Webcast in which a public company discusses the financial results of a reporting period.
- Amazon-10K-Filing.pdf Comprehensive report filed annually by a publicly-traded company about its financial performance
- Amazon-8K-Report.xls Major events that shareholders should know about.
import requests
def download_file(url):
# Get the response from the URL
response = requests.get(url)
# Check if the response was successful
if response.status_code == 200:
# Get the content of the response
content = response.content
# Create a file and write the content to it
with open("file.txt", "wb") as file:
file.write(content)
print("File downloaded successfully")
else:
print("Failed to download file")
Now we can download each file (and any other future ones) individually
download_file("https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Amazon-Quarterly-Earnings-Call-Q3-2022-Full-Call-v2.wav")
Extract the Contents
We need to do different things depending on the file type.
Run Optical Character Recognition (OCR) algorithms on the file, capturing the output as well as the pages where each content lives.
import PyPDF2
def extract_pdf_text(pdf_file):
# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# Create a dictionary to store the text extracted from the PDF
text = {}
# Loop through each page of the PDF
for page in range(pdf_reader.numPages):
# Extract the text from the current page
page_text = pdf_reader.getPage(page).extractText()
# Add the text from the current page to the dictionary
text[page + 1] = page_text
# Return the dictionary of text
return text
# the output
{
1: 'This is the text on page 1 of the PDF file.',
2: 'This is the text on page 2 of the PDF file.'
}
Spreadsheet
Extract the content of each cell, and store the rows in which the content exists in a seperate metadata store.
import openpyxl
def extract_excel_data(excel_file):
# Open the Excel file
wb = openpyxl.load_workbook(excel_file)
# Get the active sheet
sheet = wb.active
# Create a dictionary to store the data
data = {}
# Loop through each row of the sheet
for row in sheet.rows:
# Get the values from each cell in the row
row_data = [cell.value for cell in row]
# Add the row data to the dictionary, using the row number as the key
data[row[0].row] = row_data
# Return the dictionary of data
return data
# The output
{
1: ['Column 1', 'Column 2', 'Column 3'],
2: ['Row 1, Column 1', 'Row 1, Column 2', 'Row 1, Column 3'],
3: ['Row 2, Column 1', 'Row 2, Column 2', 'Row 2, Column 3']
}
Audio
First split it into 30 second chunks, and transcribe each chunk noting the timestamp where the words were mentioned.
import speech_recognition as sr
def transcribe_audio_file(audio_file):
# Create a speech recognition object
r = sr.Recognizer()
# Open the audio file
with sr.AudioFile(audio_file) as source:
# Read the audio file
audio = r.record(source)
# Create a list to store the transcriptions
transcriptions = []
# Split the audio file into 30-second chunks
for i in range(0, len(audio), 30 * 1000):
# Get the current chunk of audio
chunk = audio[i:i + 30 * 1000]
# Transcribe the chunk of audio
transcription = r.recognize_google(chunk)
# Calculate the timestamp for the transcription
timestamp = i / 1000
# Add the transcription and timestamp to the list
transcriptions.append((transcription, timestamp))
# Return the list of transcriptions
return transcriptions
Run Each Through Vector Embedding
Now in order to do proper Full Text Search, we need to run the outputs of each file through a Large Language Model in order to retrieve the corresponding vector representations.
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
corpus_vector = model.encode(docs)
Keep in mind, we will also need to store the corresponding metadata of each content and vector representations.
Now do Our Investment Research
Once we have our vector emebeddings, we can run vector similarity queries. I personally wanted to know what AWS' competitive moat was for their Serverless announcement on OpenSearch.
query = "What is OpenSearch's Serverless competitive moat?"
query_vector = model.encode(query)
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
# response
0.3428829073905945 enhanced technologies, including search, web and infrastructure
...
Never Parse Files Again
Point the Mixpeek agent to your files, and let us handle all the work. Then just search like you would google.
from mixpeek import Mixpeek
mix = Mixpeek(mixpeek_key="API_KEY")
mix.index(["10k.pdf", "earnings_call.wav", "8k_report.xls"])
results = mix.search("What is OpenSearch's competitive moat?")
[
# earnings call audio file
{
"file_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Amazon-Quarterly-Earnings-Call-Q3-2022-Full-Call-v2.wav",
"file_id": "63738f90829faf6a25053f64",
"importance": "100%",
"meta": {"timestamp": 5785},
},
# 10K document in PDF
{
"file_url": "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/f965e5c3-fded-45d3-bbdb-f750f156dcc9.pdf",
"file_id": "63738fa2829faf6a25053f65",
"importance": "98%",
"meta": {
"page_number": 6,
"context": [
{"text": "enhanced technologies, including"},
{"hit": "search"},
{"text": ", web and infrastructure computing services"},
],
},
},
# 8K spreadsheet report
{
"file_url": "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d8702a24-e840-41f0-a1e6-16351ca9fe4d.xls",
"file_id": "63738fa2829faf6a25053f66",
"importance": "72%",
"meta": {},
},
]