Introduction

Investing is overwhelming. There are so many different data points and sources to consider when deciding who will generate a return on your hard-earned cash.

A wise investor told me to consider companies that I personally use. It made sense to research the very company that powers Mixpeek, Amazon Web Services. In doing so, I had to scour dozens of assets on their Investor Relations page between earnings calls, 10K documents, careers pages and more.

I will walk you through how I enriched my own Discounted Cash Flow Model for Amazon.com, Inc. using entirely off-the-shelf technologies.

My first data point of interest is understanding how their NLP products are performing, so I need to ask a question like:

"What was the OpenSearch Annual Contract Value from 2021?"

Just because the tools are free, doesn't mean the cost of ownership is. There is a managed alternative

Download the Files

There are three different file types we need in order to conduct our due diligence:

- Amazon-Quarterly-Earnings-Call.wav Webcast in which a public company discusses the financial results of a reporting period.

- Amazon-10K-Filing.pdf Comprehensive report filed annually by a publicly-traded company about its financial performance

- Amazon-8K-Report.xls Major events that shareholders should know about.

import requests

def download_file(url):
  # Get the response from the URL
  response = requests.get(url)
  
  # Check if the response was successful
  if response.status_code == 200:
    # Get the content of the response
    content = response.content
    
    # Create a file and write the content to it
    with open("file.txt", "wb") as file:
      file.write(content)
      print("File downloaded successfully")
  else:
    print("Failed to download file")

Now we can download each file (and any other future ones) individually

download_file("https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Amazon-Quarterly-Earnings-Call-Q3-2022-Full-Call-v2.wav")

Extract the Contents

We need to do different things depending on the file type.

PDF

Run Optical Character Recognition (OCR) algorithms on the file, capturing the output as well as the pages where each content lives.

import PyPDF2

def extract_pdf_text(pdf_file):
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(pdf_file)

    # Create a dictionary to store the text extracted from the PDF
    text = {}

    # Loop through each page of the PDF
    for page in range(pdf_reader.numPages):
        # Extract the text from the current page
        page_text = pdf_reader.getPage(page).extractText()

        # Add the text from the current page to the dictionary
        text[page + 1] = page_text

    # Return the dictionary of text
    return text                      
                 
# the output
{
  1: 'This is the text on page 1 of the PDF file.',
  2: 'This is the text on page 2 of the PDF file.'
}

Spreadsheet

Extract the content of each cell, and store the rows in which the content exists in a seperate metadata store.

import openpyxl

def extract_excel_data(excel_file):
    # Open the Excel file
    wb = openpyxl.load_workbook(excel_file)

    # Get the active sheet
    sheet = wb.active

    # Create a dictionary to store the data
    data = {}

    # Loop through each row of the sheet
    for row in sheet.rows:
        # Get the values from each cell in the row
        row_data = [cell.value for cell in row]

        # Add the row data to the dictionary, using the row number as the key
        data[row[0].row] = row_data

    # Return the dictionary of data
    return data   
# The output
{
  1: ['Column 1', 'Column 2', 'Column 3'],
  2: ['Row 1, Column 1', 'Row 1, Column 2', 'Row 1, Column 3'],
  3: ['Row 2, Column 1', 'Row 2, Column 2', 'Row 2, Column 3']
}

Audio

First split it into 30 second chunks, and transcribe each chunk noting the timestamp where the words were mentioned.

import speech_recognition as sr

def transcribe_audio_file(audio_file):
    # Create a speech recognition object
    r = sr.Recognizer()

    # Open the audio file
    with sr.AudioFile(audio_file) as source:
        # Read the audio file
        audio = r.record(source)

    # Create a list to store the transcriptions
    transcriptions = []

    # Split the audio file into 30-second chunks
    for i in range(0, len(audio), 30 * 1000):
        # Get the current chunk of audio
        chunk = audio[i:i + 30 * 1000]

        # Transcribe the chunk of audio
        transcription = r.recognize_google(chunk)

        # Calculate the timestamp for the transcription
        timestamp = i / 1000

        # Add the transcription and timestamp to the list
        transcriptions.append((transcription, timestamp))

    # Return the list of transcriptions
    return transcriptions  

Run Each Through Vector Embedding

Now in order to do proper Full Text Search, we need to run the outputs of each file through a Large Language Model in order to retrieve the corresponding vector representations.

💡
Vector embeddings are a deeply complicated topic. Read More About them Here
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

corpus_vector = model.encode(docs)

Keep in mind, we will also need to store the corresponding metadata of each content and vector representations.

Now do Our Investment Research

Once we have our vector emebeddings, we can run vector similarity queries. I personally wanted to know what AWS' competitive moat was for their Serverless announcement on OpenSearch.

query = "What is OpenSearch's Serverless competitive moat?"
query_vector = model.encode(query)

scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()

# response
0.3428829073905945 enhanced technologies, including search, web and infrastructure
...
         

Never Parse Files Again

Point the Mixpeek agent to your files, and let us handle all the work. Then just search like you would google.

from mixpeek import Mixpeek

mix = Mixpeek(mixpeek_key="API_KEY")

mix.index(["10k.pdf", "earnings_call.wav", "8k_report.xls"])

results = mix.search("What is OpenSearch's competitive moat?")    


[
  # earnings call audio file
  {
      "file_url": "https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Amazon-Quarterly-Earnings-Call-Q3-2022-Full-Call-v2.wav",
      "file_id": "63738f90829faf6a25053f64",
      "importance": "100%",
      "meta": {"timestamp": 5785},
  },
  # 10K document in PDF
  {
      "file_url": "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/f965e5c3-fded-45d3-bbdb-f750f156dcc9.pdf",
      "file_id": "63738fa2829faf6a25053f65",
      "importance": "98%",
      "meta": {
          "page_number": 6,
          "context": [
              {"text": "enhanced technologies, including"},
              {"hit": "search"},
              {"text": ", web and infrastructure computing services"},
          ],
      },
  },
  # 8K spreadsheet report
  {
      "file_url": "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d8702a24-e840-41f0-a1e6-16351ca9fe4d.xls",
      "file_id": "63738fa2829faf6a25053f66",
      "importance": "72%",
      "meta": {},
  },
] 
About the author
Ethan Steininger

Ethan Steininger

Former GTM Lead of MongoDB's NLP platform, Atlas Search. Occasionally off the grid in his self-converted camper van.

Multimodal Makers | Mixpeek

Multimodal Pipelines for AI

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.